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Abstract 

This work studies the recursive robust principal components' analysis (PCA) problem. Here, "robust" refers to robustness to 
£f} . both independent and correlated sparse outliers. If the outlier is the signal-of-interest, this problem can be interpreted as one of 

recursively recovering a time sequence of sparse vectors, St, in the presence of large but structured noise, Lt. The structure that 
we assume on Lt is that Lt is dense and lies in a low dimensional subspace that is either fixed or changes "slowly enough". A 
key application where this problem occurs is in video surveillance where the goal is to separate a slowly changing background 
(Lt) from moving foreground objects (St) on-the-fly. To solve the above problem, we introduce a novel solution called Recursive 



o 

(N 



pT_^ ■ Projected CS (ReProCS). Under mild assumptions, we show that, with high probability (w.h.p.), ReProCS can exactly recover the 

ly-^ , support set of St at all times; and the reconstruction errors of both St and L t are upper bounded by a time-invariant and small 

value at all times. We also show how the algorithm and its guarantees extend to the undersampled measurements' case. 

Keywords: robust PCA, sparse recovery, compressive sensing 
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q ■ I. Introduction 

Principal Components' Analysis (PCA) is a widely used dimension reduction technique that finds a small number of 

^ | orthogonal basis vectors, called principal components (PCs), along which most of the variability of the dataset lies. It is 

well known that PCA is sensitive to outliers. Accurately computing the PCs in the presence of outliers is called robust PCA 

' 0, H, 0, |6|. Often, for time series data, the PCs space changes gradually over time. Updating it on-the-fly (recursively) 

' in the presence of outliers, as more data comes in is referred to as online or recursive robust PCA |]7|, fl8], [0. "Outlier" is a 

loosely defined term that refers to any corruption that is not small compared to the true data vector and that occurs occasionally. 

£NJ ■ As suggested in [10], [5|, an outlier can be nicely modeled as a sparse vector whose nonzero values can have any magnitude. 

A key application where the robust PCA problem occurs is in video analysis where the goal is to separate a slowly changing 

background from moving foreground objects [4], [5 1. If we stack each frame as a column vector, the background is well modeled 

• as lying in a low dimensional subspace that may gradually change over time, while the moving foreground objects constitute 
'_ , 

d ■ the sparse outliers ifTOl . J5] which change in a correlated fashion over time. Other applications include sensor networks based 
detection and tracking of abnormal events such as forest fires or oil spills; or online detection of brain activation patterns from 
functional MRI (fMRI) sequences (the "active" part of the brain can be interpreted as a correlated sparse outlier). Clearly, in 
all these applications, an online solution is desirable. In this work, we focus on this case, i.e. on recursive robust PCA that is 
robust to both independent and correlated sparse outliers. 

The moving objects or the brain active regions or the oil spill region may be "outliers" for the PCA problem, but in most 
cases, these are actually the signals-of-interest whereas the background image is the noise. Also, all the above signals-of- 
interest are sparse vectors that change in a correlated fashion over time. Thus, this problem can also be interpreted as one of 
recursively recovering a time sequence of correlated sparse signals, St, from measurements M t := St + L t that are corrupted 
by (potentially) large magnitude but dense and structured noise, L t . The structure that we require is that L t be dense and lie 
in a low dimensional subspace that is either fixed or changes "slowly enough" in the sense quantified in Sec IIII-BI For the 
fMRI application, one would also like to study the following more general problem: recover St from Al t := ASt + L t where 
L t = AL t and A is a matrix with more columns than rows (fat matrix). Since L t is low-dimensional, so is L t . 
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A. Related Work 

There has been a large amount of work on robust PC A, e.g. |0, |0, |0, 0, ifTTI . fl2l . 1131 . and recursive robust PC A 
e.g. 0, 10, |0. In most of these works, either the locations of the missing/corruped data points are assumed known Q (not 
a practical assumption); or they first detect the corrupted data points and then replace their values using nearby values 0; or 
weight each data point in proportion to its reliability (thus soft-detecting and down-weighting the likely outliers) 0, [0; or 
just remove the entire outlier vector lil2l . ifPTI . Detecting or soft-detecting outliers (St) as in |0, |0, |0 is easy when the 
outlier magnitude is large, but not otherwise. When the signal of interest is St, the most difficult situation is when nonzero 
elements of S t have small magnitude compared to those of L t and in this case, these approaches do not work. 

In a series of recent works |0, 10, a new and elegant solution to robust PCA called Principal Components' Pursuit (PCP) 
has been proposed, that does not require a two step outlier location detection/correction process and also does not throw out 
the entire vector. It redefines batch robust PCA as a problem of separating a low rank matrix, Ct := [L%, . . . , L t ], from a 
sparse matrix, St := [Si,..., St], using the measurement matrix, M. t '■— [Mi, . . . , M t ] = Ct + St- It was shown in Q that 
one can recover Ct and St exactly by solving 

min||£||* + A||<S||i subject to £ + S = M t (1) 
provided that (a) Ct is dense (its left and right singular vectors satisfy certain conditions); (b) any element of the matrix St is 
nonzero w.p. g, and zero w.p. 1 — g, independent of all others (in particular, this means that the support sets of the different 
St's are independent over time); and (c) the rank of Ct and the support size of St are small enough. Here || j4|| * is the nuclear 
norm of A (sum of singular values of A) while \\A\\i is the l\ norm of A seen as a long vector. 

As explained earlier, a key application where the robust PCA problem occurs is in video layering where the foreground 
sequence is sparse while the background sequence is approximately low dimensional. In this case, it is fair to assume that the 
background changes are dense (i.e. Ct is dense). However, the assumption that the foreground support is independent over 
time is not a valid one. Foreground objects typically move in a correlated fashion, and may even not move for a few frames. 
This often results in S t being sparse as well as low rank. 

B. Motivation for Proposed Algorithm 

The question then is, what can we do if Ct is low rank and dense, but St is both sparse and low rank? Clearly in this case, 
without any extra information, in general, it is not possible to separate St and Ct- Suppose that an initial short sequence of 
Lis is available. For example, in the video application, it is often realistic to assume that an initial background-only training 
sequence is available. Can we use this to do anything better? 

One possible solution is as follows. We can compute the matrix containing the left singular vectors of the initial short 
training sequence, Pq. This can be used to modify PCP as follows. We solve 

min||S||i, subject to - P P^(M t - S)\\ F < e, (2) 

where \\.\\f is the Frobenius norm. This then becomes the standard l\ minimization solution for a batch sparse recovery 
problem in noise. As we show later in Lemma [16] denseness of Pq (ensured by denseness of Ct), ensures that the restricted 
isometry constant for (/ — PqPq) is small and hence St can be recovered accurately by solving (0 as long as the "noise" it 
sees is small. Here the "noise" is (/ — pQp^Ct- This is small only if span(Po) approximately contains span(£ f ) (in the sense 
defined in Definition 0, i.e. the subspace spanned by the future background frames is an approximate subset of that of the 
initial training dataset. This is unreasonable to expect in a long sequence. Even though the change of subspace from one time 
instant to the next is usually "slow" (one way to quantify this is given in Sec IIII-BI ). the net change over a long sequence can 
be significant. 

To address this issue, we can replace the above by the following. Using P n , we solve (0 for the next set of a frames and 
use the resulting estimates, S t , to get L t = M t — S t - If the subspace changes during this period, because of the slow subspace 
change assumption (of Sec IIII-BI ). the projection of L t along the newly added directions will be small for the first a frames, 
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thus ensuring that the St's, and hence the Lt's, are accurately estimated for this period. The estimated L t 's can be used in 
a recursive PCA or a projection PCA algorithm to get an updated estimate of span(£ f ) which now includes the span of the 
newly added directions. Using the new subspace basis estimate, P, for solving (O for the next set of a frames, will reduce the 
"noise" seen by it. Thus a more accurate set of St's, and hence Lt's, can be computed for this period. This, in turn, will help 
get a more accurate estimate of span(£ f ). For simplicity, and to get a fully recursive solution, one can replace (O by solving 
the l\ problem at each time separately. This is the key idea of our proposed algorithm which we call Recursive Projected 
Compressive Sensing or ReProCS. 

In an earlier conference paper [1 1, we first introduced the ReProCS idea. It used an algorithm motivated by recursive PCA 
[7 1 for updating the subspace estimates on-the-fly. Recursive PCA is a fast algorithm for solving the PCA problem when data 
comes in sequentially. However, as we explain in Appendix [F] it is difficult to obtain performance guarantees for PCA. In 
this work, we instead use a modification called projection PCA, which can be analyzed more easily. The performance of both 
approaches in simulation experiments is similar. 

In Sec HX-Bl we also explain how ReProCS can be extended to the undersampled problem: recover St from M. t '■= ASt+Ct 
when A is a fat matrix and how our approach for obtaining the performance guarantees also extends. This algorithm was 
introduced in our recent conference paper [0. 

C. Our Contributions 

1) Ours is among the first few works that studies sparse recovery in (potentially) large but structured noise: the noise needs 
to lie in a "slowly changing" low dimensional subspace as defined in Sec IIII-BI Under mild assumptions, we show that, 
w.h.p, ReProCS can exactly recover the support of St at all times; and the reconstruction errors of both St and L t are 
upper bounded by a time invariant and small value at all times. PCP @, ||6) can be thought of as a batch solution to this 
problem, however it requires independent support change as explained earlier. Some other recent works that also studied 
sparse recovery in large but structured noise include [10 j, fl4l . lfT31 . but all of these focus on sparse noise or outliers. 

2) If L t is the signal of interest, then ReProCS is a solution to recursive robust PCA in the presence of sparse and possibly 
correlated outliers. To the best of our knowledge, this is the first rigorous analysis of any recursive (online) robust PCA 
approach and definitely the first to study recursive (online) robust PCA with correlated outliers. Our results directly apply 
to the missing data case as well or equivalently the case where the outlier locations are known (see Sec IIX-AI >. 

3) The proof techniques used in our work are very different from those used to obtain performance guarantees in recent 
batch robust PCA works such as [5 |, [6], ifTTI . iTPJl . ifTZl . The works of [13], [12] also study a different case: that where 
an entire vector is either an outlier or an inlier. Our proof utilizes (a) sparse recovery results [16 |; (b) results from matrix 
perturbation theory that bound the estimation error in computing the eigenvectors of a perturbed Hermitian matrix with 
respect to eigenvectors of the original Hermitian matrix (sin 9 theorem [171), and that bound the perturbed eigenvalues 
(Weyl's theorem |18|) and (c) high probability bounds on eigenvalues of sums of independent random matrices (matrix 
Hoeffding inequality |fl9ll ). 

A key difference of our approach compared with most existing work analyzing finite sample PCA, e.g. [20| and references 
therein, is that it needs to provably work in the presence of perturbation/noise that is correlated with L t . Most existing 
works, including ll20l and the references it discusses, assume that the noise is independent of the data. 
When L t is the signal of interest, the ReProCS approach is related to that of ll2D . 11221 . Il23l in that all of these first try 
to nullify the low dimensional signal by projecting the measurement vector into a subspace perpendicular to that of the low 
dimensional signal, and then solve for the sparse "error" vector (outlier). However, the big difference is that in all of these works 
the basis for the subspace of the low dimensional signal is perfectly known. Our work studies the case where the subspace is 
not known. We have an initial approximate estimate of the subspace, but over time it can change quite significantly. The only 
thing we require is that the changes per unit time are "slow" in a sense quantified in Sec IIII-BI 

In this work, to keep things simple, we use i i minimization done separately for each time instant (also referred to as basis 
pursuit denoising (BPDN)) lfl6l . ||24| . However, this can be replaced by any other sparse recovery algorithm, either recursive 
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or batch, as long as the batch algorithm is applied to a frames at a time (with a selected as explained in Sec |IV| i. Notice that 
ReProCS allows correlated sparse vectors. If something is known about the correlation model, one could replace BPDN by 
modified-CS or support-predicted modified-CS [25 1 . Also, many of the batch CS algorithms from literature could be used. 

D. Paper Organization 

The rest of the paper is organized as follows. We give the notation and background required for the rest of the paper in 
Sec mi The problem definition and the model assumptions are given in Sec [III] We explain the ReProCS algorithm and give 
its performance guarantees (Theorem [UTt in Sec [IV] The proof outline and the terms used in the proof are defined in Sec 
[VI The key lemmas needed to prove Theorem [T8l and their proofs, are given in Sec I VII and Sec IVHI The theorem's proof 
follows easily using these lemmas. This is given in Sec lVIIII A more general subspace change model, ReProCS with deletion, 
and the extension of our results to the missing data and the undersampled measurements' cases are discussed in Sec [IX] In 
Sec IX-A1 we show that our slow subspace change model indeed holds for real videos. In Sec IX-BI we explain how one can 
automatically set parameters for ReproCS in practice. In Sec IX-CI we show numerical experiments demonstrating Theorem 
[T8l as well as comparisons of ReProCS and practical ReProCS with PCP Conclusions and future work are given in Sec |XI] 

II. Notation and Background 

A. Notation 

For a set T C {1, 2, . . . n}, we use \T\ to denote its cardinality, i.e., the number of elements in T. We use T c to denote its 
complement w.r.t. {1, 2, ...n}, i.e. T c := {i G {1, 2, . . . n} : i f T}. 

We use the interval notation, [tj.,t2], to denote the set of all integers between and including <i to i2, i.e. [iijia] : = 
{ti, t\ + 1, . . . t2}- For a vector v, Vi denotes the ith entry of v and vt denotes a vector consisting of the entries of v indexed 
by T. We use \\v\\ p to denote the £ p norm of v. The support of v, supp(t>), is the set of indices at which v is nonzero, 
supp(u) := {i : Vi 0}. We say that v is s-sparse if |supp(w)| < s. 

For a matrix B, B' denotes its transpose, and B^ its pseudo-inverse. For a matrix with linearly independent columns, 
B^ = (B'B)~ 1 B'. We use ||-B||2 ■— max^o H-EMh/IMh to denote the induced 2-norm of the matrix. Also, \\B\\* is the 
nuclear norm (sum of singular values) and ||-B|| roax denotes the maximum over the absolute values of all its entries. We let 
Ui(B) denotes the zth largest singular values of B. For a Hermitian matrix, B, we use the notation B E = D UMJ 1 to denote 
the eigenvalue decomposition of B. Here U is an orthonormal matrix and A is a diagonal matrix with entries arranged in 
non-increasing order. Also, we use Ai(B) to denote the ith largest eigenvalue of a Hermitian matrix B and we use A max (B) and 
Amin(-B) denote its maximum and minimum eigenvalues. If B is Hermitian positive semi-definite (p.s.d.), then Xi(B) = (Ti(B). 
For Hermitian matrices B\ and B2, the notation B\ < B2 means that B2 — B\ is p.s.d. Similarly, B\ >z B2 means that B\ — B2 
is p.s.d. 

For a Hermitian matrix B, \\B\\ 2 = v /max(A^ iax (B), A^ in (B)) and thus, ||S|| 2 < b implies that -b < X min (B) < 
A max (5) < b. 

We use / to denote an identity matrix. For an index set T and a matrix B, B? is the sub-matrix of B containing columns 
with indices in the set T. Notice that Bt '■= BIt- For an m x n matrix B, we use B \ Bt to denote Bt? (here T c := 
{i € {1, 2, . . . n} : i £ T}). Given a matrix B of size m x n and B2 of size m x 77,2, [B B2] constructs a new matrix by 
concatenating matrices B and B2 in a horizontal direction. Thus, [B B2] \ Bt ■— [(£?)t c i? 2 ]- For any matrix B and sets 
Ti,T 2 , {B)xi,T 2 denotes the sub-matrix containing the rows with indices in T\ and columns with indices in T 2 . 

For a tall matrix P, span(P) denotes the subspace spanned by the column vectors of P. 

The notation [.] denotes an empty matrix. 

Definition 1: We refer to a tall matrix P as a basis matrix if it satisfies P'P = I. 

Definition 2: For a basis matrix P and any other matrix B, we say that "span(P) approximately contains span(B)" if 
||(/-PP')S|| 2 /||B||2 is small. 
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Definition 3: The s-restricted isometry constant (RIC) 112 11 . S s , for annxm matrix ^ is the smallest real number satisfying 
(1 - 8 s )\\x\\% < H^T^IIl < (1 + ^)11^111 for a11 sets rc{l,2,...n} with \T\ < s and all real vectors x of length \T\. 
It is easy to see that max T .| X |< s || (W^t) -1 h < T=<tw) ^ 
Definition 4: We give some notation for random variables in this definition. 

1) We let E[Z] denote the expectation of a random variable (r.v.) Z and E[Z|X] denote its conditional expectation given 
another r.v. X. 

2) Let B be a set of values that a r.v. Z can take. We use B e to denote the event Z £ B, i.e. B e := {Z £ B}. 

3) The probability of any event B e can be expressed as [26 1, 

P(B e ) :=E[I B (Z)}. 

where 

f 1 if Z £ B 
I otherwise 

is the indicator function on the set B. 

4) For two events B e , B e , P{B e \B e ) refers to the conditional probability of B e given B e , i.e. P(i3 e |i3 e ) 

5) For a r.v. X, and a set £> of values that the r.v. Z can take, the notation P(B e \X) is defined as 

P(B e \X) :=E[I B (Z)\X}. 

Notice that P(B e \X) is a r.v. (it is a function of the r.v. X) that always lies between zero and one. 

Finally, RHS refers to the right hand side of an equation or inequality; w.p. means "with probability"; and w.h.p. means 
"with high probability". Also we use a < b to indicate (in a non-rigorous sense) that the dominant term in the upper bound 
on a is b. 



:=P{B e ,B e )/P{B e ). 



B. Compressive Sensing result 

The error bound for noisy compressive sensing (CS) based on the RIC is as follows [16|. 
Theorem 5 (/[Wp: Suppose we observe 

y := f i + z 

where z is the noise. Let x be the solution to following problem 

min||x||i subject to \\y - *x|| 2 < £ (3) 
Assume that x is s-sparse, \\z\\2 < and (^("J/) < b(y/2 — 1) with a < b < 1. Then the solution of ([3]) obeys 

\\x-x\\ 2 <Ci£ 

1 1-(V2+1)&»(*) - l-(\/2+l)fc(V2-l)- 

Remark 6: Notice that if b is small enough, C\ is a small constant but C\ > 1. For example, if &s(^) < 0.15, then C\ < 7. 
If Ci£ > \\x\\2, the normalized reconstruction error bound would be greater than 1, making the result useless. Hence, © gives 
a small reconstruction error bound only for the small noise case, i.e., the case where ||z||2 < £ <C ||^||2- In fact this is true 
for most existing literature on CS and sparse recovery, with the exception of [10|, 04], lfl5l (focus on large but sparse noise) 
and 0, (6). 
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C. Results from linear algebra 

Kahan and Davis's sin6> theorem [17| studies the rotation of eigenvectors by perturbation. 
Theorem 7 fsin6* theorem fJTj): Given two Hermitian matrices A and % satisfying 



A 
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' E' ' 


Aj_ 
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EE, 



' H B' ' 




' E' ' 


B H±_ 







where [E E±] is an orthonormal matrix. The two ways of representing A + W are 

~A + H B' 



A + n = 



EE, 
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"a " 




" F' ' 








FF ± 








A'. 
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B A ± +H ± _ 

where [F Fj_] is another orthonormal matrix. Let TZ ;= (A + %)E - AE = HE. If X min (A) > A max (AjJ, then 

- FF')E\\ 2 < - J n [ 2 

The above result bounds the amount by which the two subspaces span(E') and span(P) differ as a function of the norm of 
the perturbation ||7£||2 and of the gap between the minimum eigenvalue of A and the maximum eigenvalue of A^. Next, we 
state Weyl's theorem which bounds the eigenvalues of a perturbed Hermitian matrix, followed by Ostrowski's theorem. 
Theorem 8 (Weyl H18V ): Let A and H be two n x n Hermitian matrices. For each i — 1,2, ... ,n we have 

Xi(A) + A min (H) <\i{A + %)< HA) + A max (H) 

Theorem 9 (Ostrowski iUSl/): Let H and W be n x n matrices, with H Hermitian and W nonsingular. For each i = 1, 2 . . . n, 
there exists a positive real number 0j such that \ m i n {WW') <0i< A max (t¥W / ') and Xi(WHW) = 9i\i{H). Therefore, 

X min (WHW) > \ min (WW')\ min (H) 

The following lemma proves some simple linear algebra facts. 

Lemma 10: Suppose that P, P and Q are three basis matrices. Also, P and P are of the same size, Q'P — and 
||(/-PP')-P||2 =£*. Then, 

1) ||(7 -PP')PP' || 2 = ||(I-PP')AP'||2 = ||(/-PP')P||2 = ||(/-PP')P||2 = C* 

2) ||PP' - PP'|| 2 < 2||(J- PP')P\\ 2 = K* 

3) ||P'Q|| 2 <C* 



4) VW? < - PP')Q) < 1 
The proof is in the Appendix. 



D. High probability tail bounds for sums of independent random matrices 

The following lemma follows easily using Definition [4] We will use this at various places in the paper. 

Lemma 11: Suppose that B is the set of values that the r.v.s X, Y can take. Suppose that C is a set of values that the r.v. 
X can take. For a < p < 1, if V{B e \X) > p for all X eC, then P(B e \C e ) >pas long as P(C e ) > 0. 
The proof is in the Appendix. 

Next, we state the matrix Hoeffding inequality [19, Theorem 1.3] which gives tail bounds for sums of independent random 
matrices. 

Theorem 12 (Matrix Hoeffding for a zero mean Hermitian matrix /|79l/): Consider a finite sequence {Z t } of independent, 
random, Hermitian matrices of size nx n, and let {A t } be a sequence of fixed Hermitian matrices. Assume that each random 
matrix satisfies (i) P(Zf < A\) = 1 and (ii) E(Z t ) = 0. Then, for all e > 0, 

P(A max (^^t) < e) > l-nexp(- — ), where a 2 = 

The following two corollaries of Theorem Q~2] are easy to prove. The proofs are given in the Appendix. 
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t t t t 2 tj tj 

I I I — I I — J 

» , ' l , ' 1 » ' 

P (t ) = Po p ( o = Pi = IP« Pi,™ J p co = p j = W-i p ;-n«wl 



Fig. 1. The subspace basis change model explained in Sec IIII-Al 



Corollary 13 (Matrix Hoeffding conditioned on another random variable for a nonzero mean Hermitian matrix): 
Consider an a-length sequence {Z t } of random Hermitian matrices of size n x n that are conditionally independent given a 
random variable X. Assume also that, for all X e C, (i) P(6i7 < Z t < b 2 I\X) = 1 and (ii) b 3 I < ± ^E^X) ^ 
Then for all e > 0, 

1 2 
P(A max (- V Z t ) <b 4 + e\X) > 1 - nexp(~ - ) for all X e C 

a 8(62 - o\Y 

P(A min (- V Z t ) >b 3 - e\X) > 1 -nexp(- a£ ) for all X e C 
a 8(o 2 - Oi) 

The proof is in the Appendix. 

Corollary 14 (Matrix Hoeffding conditioned on another random variable for an arbitrary nonzero mean matrix): 
Consider an a-length sequence {Z t } of random n\ x n 2 matrices that are conditionally independent given a random 
variable X. Assume also that, for all X G C, (i) P(||Z t || 2 < b x \X) = 1 and (ii) \\±J2t ~&(Z t \X)\\ 2 < b 2 . Then, for all e > 0, 

P(|| - Y^Zth < b 2 + e\X) > 1 - (m + n 2 ) exp(-^) for all X e C 
(y, oZiO^ 

The proof is in the Appendix. 

III. Problem Definition and Model Assumptions 
We give the problem definition below followed by the model and then describe the two key assumptions. 



A. Problem Definition 

The measurement vector at time t, M t , is an n dimensional vector which can be decomposed as 

M t =L t + St (4) 

Here St is a sparse vector with support set size at most s and minimum magnitude of nonzero values at least S'min. L t is a 
dense but low dimensional vector that satisfies the model given below. We are given an accurate estimate of the subspace in 
which the initial i tra j n L t 's lie, i.e. we are given a basis matrix Po so that ||(J — PoPo^tmiJb/ll^miJh i s small. The goal is 

1) to estimate both St and L t at each time t > itrain, and 

2) to estimate span(£ t ) every so often, i.e. compute so that — P(t)P( t ))>Ct||2/||>Ct||2 is small. 

Notation for 5 t . We do not assume anything about St except sparsity. Let T t := {i : (St)i ^ 0} denote the support of 
St- Define 

S'min := minmin |(Sf),-|, and s :— max|T t | 

t l£T t t 

Model on L t . {L t } is a sequence of dense vectors satisfying the following model. 

1) L t lies in a low dimensional subspace that changes every-so-often. Let tj denote the change times. Then the following 
holds. 



a) L t — P(t) a t with Pu\ — Pj for all tj <t < j = 0, 1,2- • • J, i.e. there is a maximum of J subspace change 
times. We can define tj+i = oo. 

b) Pj is an n x basis matrix with rj <C n and 77 <C (tj+i — ij). 

c) At the change times, tj, Pj changes as Pj = [Pj-i Pj,new] where Pj,„ e w is a nXCj jBev/ basis matrix with Pj new Pj^i = 
0. Thus rj = Tj-i + c j new . This model is illustrated in Fig [T] 

d) There exists a constant c mx such that < Cj ine „ < c mx for all j. 

2) The projection vector, at := Pn)'Lt, is a random variable (r.v.) with the following properties. 

a) at is mutually independent over t. 

b) It is a zero mean bounded r.v., i.e. E(a t ) = and there exists a constant 7* s.t. ||at||oo < 7* f° r t. 

c) Its covariance matrix At := Cov[aj] = E(a t a' t ) is diagonal with A~ := mirij A m i n (A t ) > and A + := 
max t A max (A t ) < 00. Thus the condition number of any A t is bounded by / 

d) For tj <t < tj+i, a t = P'jLt is an rj length vector which can be split as 



A- ' 



at,* 

^£,new 

where a t * := Pj_x'L t is an Tj_x length vector of projections along the existing directions and a t , new : = -P/new'-^t 
is a c j new length vector of projections along the new ones. Thus, for this interval, L t can be rewritten as 

Lt Pj — \Ort,* ~t~ Pj,newQ<t,new 

(A t ), 



Also, A t can be split as A t = v where (A t )* = Cov[a t »] and (A t ) new = Cov[a tnew ] are diagonal 

(A t ) new J 

matrices. 

3) Pj and a t change slowly in the sense quantified below in Sec IIII-BI 

4) The Lt's, and hence their subspace basis matrices Pj, are dense as quantified in Sec IIII-Cl 
We discuss the above model assumptions after stating our main result in Sec. IIV-BI 

In the above model, the rank of Pj will keep increasing whenever Cj, ne w > 0. In other words, the dimension of the subspace 
in which the current L t will keep increasing. However, in practical applications, this may not happen. Some directions may 
also get removed from Pj so that the subspace dimension remains roughly constant with time. This can be modeled as 
Pj = [Pj-i -fj.new] \ Pj.o\d- As we show in Corollary |42] in Sec IIX-CI even with this more general model, our proposed 
algorithm directly applies and its performance guarantees also change only slightly. 

We do not use this more general model here to keep the notation simple. Recall that our goal is to estimate span(£t). With 
our current model (no removals), span(P( t )) = span(£ t ) and thus our goal can be equivalently stated as one of computing a 
basis matrix P^ t ) so that span(P( t )) w span(P( t )). 

B. Slow subspace change 

By slow subspace change we mean all of the following. 

First, the delay between consecutive subspace change times, tj+i — tj, is large enough. 

Second, the projection of Lt along the newly added directions, at t aew, is initially small, i.e. xnaxtj<t<u+a ||at,new||oo < 
7new, with 7„ ew <C 7* and 7 new <C S m i u , but can increase gradually. We model this as follows. Split the interval [tj,tj+i — 1] 
into a length periods. We assume that 

max max ||a M ew||oo < 7new,fc := min(u fc_1 7 new , 7*) 

j te[tj+(k—l)a,tj+ka—l] 

for an>l but not too largeQ. This assumption is verified for real video data in Sec. IX-AI 

Third, the number of newly added directions is small, i.e. Cj. new < c mx -C tq. This is also verified in Sec. IX-AI 

'Small 7nc W and slowly increasing 7 new ,fc is needed for the noise seen by the sparse recovery step to be small. However, if 7new is zero or very small, it 
will be impossible to estimate the new subspace. This will not happen in our model because 7 nl;w > A~ > 0. 
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C. Measuring denseness of a matrix and its relation with RIC 

For a tall n x r matrix, B, or for a n x 1 vector, B, we define the the denseness coefficient as follows: 

K S (B) := max — . 
m<« ||P|| 2 

where ||.||2 is the matrix or vector 2-norm respectively. Clearly, k s (B) < 1. First consider an n-length vector B. Then n s 
measures the denseness (non-compressibility) of B. A small value indicates that the entries in B are spread out, i.e. it is a 
dense vector. A large value indicates that it is compressible (approximately or exactly sparse). The worst case (largest possible 
value) is K S {B) = 1 which indicates that B is an s-sparse vector. The best case is k s (B) = yjs/n and this will occur if each 
entry of B has the same magnitude. Similarly, for annxr matrix B, a small n s means that most (or all) of its columns are 
dense vectors. 

Remark 15: The following facts should be noted about K s (.). 

1) For annxr matrix B, K S (B) is a non-decreasing function of s. 

2) For annxr basis matrix B, k s (B) is a non-decreasing function of r = rank(P). 

3) A loose bound on n s (B) obtained using triangle inequality is n s (B) < ski(B). 

4) For a basis matrix P, \\P\\2 = 1 and hence K S (P) = max| T |< s ||IyP||2 and k s (PP') — n s (P). Thus, for any other basis 
matrix Q for which span(Q) = span(P), k s (P) = K S (Q). Thus, k s (P) is a property of span(P), which is the subspace 
spanned by the columns of P, and not of the actual entries of P. 

The lemma below relates the denseness coefficient of a basis matrix P to the RIC of I—PP'. The proof is in the Appendix. 

Lemma 16: For an n x r basis matrix P (i.e P satisfying P'P = I), 

5.(1 - PP 1 ) = k 2 s (P). 

In other words, if P is dense enough (small k s ), then the RIC of I — PP' is small. Thus, using Theorem [5] all s-sparse 
vectors, St can be accurately recovered from y t := (I — PP')St + fit if Pt is small noise. 

In this work, we assume an upper bound on K s (Pj-\), and a tighter upper bound on K s (Pj new ). Additionally, we also assume 
denseness of another matrix, Pj in ew,fc, whose columns span the currently unestimated part of span(P, new ) (see Theorem [T8ll. 

As we explain in Sec HV-Dl the denseness coefficient k s (B) is related to the denseness assumption required by PCP 0. 

IV. Recursive Projected CS (ReProCS) and its Performance Guarantees 

In Sec UV-Al we explain the ReProCS algorithm and why it works. We give its performance guarantees in Sec. II V-B I and 
discuss the assumptions used by our result in Sec. IIV-CI A qualitative discussion w.r.t. the result for PCP is given in Sec IIV-DI 
Practical parameter setting for ReProCS is discussed later in Sec. IX-BI 

We summarize the Recursive Projected CS (ReProCS) algorithm in Algorithm Q] It uses the following definition. 

Definition 17: Define the time interval Xj.fc := [tj + (k — l)a,tj+ka — 1] for k = 1, . . .K andXj.A- + i := [tj+Ka, ij+i — 1]. 
Here, K is the algorithm parameter in Algorithm [T] 

A. Recursive Projected CS (ReProCS) 

The key idea of ReProCS is as follows. Assume that the current basis matrix Pm has been accurately predicted using past 
estimates of L t , i.e. we have P(t-i) with ||(7 — P(t-i)PLxy)P(i)\U sma ll- We project the measurement vector, M t , into the 
space perpendicular to P(t-i) to get the projected measurement vector y t := $( t )Mt where <&( t ) = I— P(t-x)P'(+-x\ (step la). 
Since the n x n projection matrix, §>n) has rank n — where = rank(P(t_i)), therefore y t has only n — "effective" 
measurement^, even though its length is n. Notice that y t can be rewritten as y t — &ft)St + fit where /3 t := $>u\Lt- Since 
— P(t-i) P! t _iy)P(t) II 2 is small, the projection nullifies most of the contribution of L t and so the projected noise f3 t is 



i.e. some r„ entries of yt are linear combinations of the other n — r* entries 
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Algorithm 1 Recursive Projected CS (ReProCS) 

Parameters: algorithm parameters: £, u>, a, K, model parameters: tj, r*o, Cj-, new 
(set as in Theorem [18] or as in Sec IX-BI when the model is not known) 

Input: Mt, Output: St, Lt, Pu) 

Initialization: Given training sequence [Li,L%, ■ ■ • , £t lrai J, estimate Po by computing an EVD as ^— X)t=i L t L t ' E XP EKE' 
and then retaining the eigenvectors with the tq largest eigenvalues, i.e., Pq <— (E){i,2,— ,r }- 
Let P( t ) 4— Pq. Let j 4— 1, k <— 1. For £ > i tra i n , do the following: 

1) Estimate T t and St via Projected CS: 

a) Nullify most of L t : compute $( t ) •<— 7 — P(t-i)PLiy compute y t <J>( t )M t 

b) Sparse Recovery: compute St.cs as the solution of min x ||x||i s.t. \\yt — <&mx\\2 < £ 

c) Support Estimate: compute T t — {i : \(St,cs)i\ > ^} 

d) LS Estimate of St: compute (§ t )f t = ((®t)f t Vyt, (St) fc = 

2) Estimate L t : L t = M t - 

3) Update P (t) by Projection PCA 

a) If t = tj + ka - 1, 

i) compute i£ teIj fe (I - P^P^Um - Pj-iP'^) B ^ 

where A& is of 

ii) set P( t ) [Pj-i -R/,new,fc]; increment fc •<— fc + 1. 
Else 

i) set P [t) <- P(t-i)- 

b) If t — tj + Ka — 1, then set Pj «— [Pj-i Pj.new,K\- Increment j <— j + 1. Reset k <— 1. 

4) Increment £ ■< — t + 1 and go to step 1 . 



Pa. 



A fe 
A fe , 



P' 

j.new.k 

pi 

7,new,fc,-L 



small. Recovering the n dimensional sparse vector St from y t now becomes a traditional sparse recovery or CS problem in 
small noise l24ll . 11271 . 11281 . We use t\ minimization to recover it (step lb). If the current basis matrix P(t), and hence its 
estimate, P( t _x), is dense enough, then, by Lemma [T6l the RIC of $>u) is small enough. Using Theorem |5] this ensures that 
St can be accurately recovered from y t . 

By thresholding on the recovered St, one gets an estimate of its support (step lc). By computing a least squares (LS) 
estimate of St on the estimated support and setting it to zero everywhere else (step Id), we can get a more accurate final 
estimate, St, as first suggested in |29l . This St is used to estimate L t as L t = M t — St- As we explain in the proof of Lemma 
l30l if the support estimation threshold, uj, is chosen appropriately, we can get exact support recovery, i.e. T t =T t . In this case, 
the error et := St — St = Lt — Lt has the following simple expression: 

e t = I Tt (®(t))Tjl3t = /T t [($( t ))T t (*(t))r t ]~ 1 ^T t '$( t )^ (5) 

The second equality follows because (^(t))r ^(t) = (^(t)7r) 3?(t) = 7r'^(t) for any set T. Consider ate X,.i. At this time, 

L t satisfies L t = Pj-\Ot,* + P/,newat,new, P(t) = Pj = [Pj-l,Pj,new], P(t-1) = Pj-1 and SO $ (t) = $ i>0 := I - Pj-iP'j_ x . 
Let <&j. k := 7 - P J _ 1 Pj_ 1 - Pi,new,fcPj, n ew,Jfc ( with ^J.new.O = [•]), Cj,k ■= || *j,fcPj,new|| 2, «a,fc ■= rOSXj K s ($j,fcPj,new), 

0fc := maxj max| T |< s || [("J^aOt (*3?j,fc)T] ~ 1 1 1 2, ?** := ^0 + (j — ^-)c mx , and c := c mx . We assume that the delay between change 
times is large enough so that by t = tj, P(t-i) = Pj-i is an accurate enough estimate of Pj-i, i.e. ||$j.o-Pj-i||2 < ?"*C 
for a C small enough. Using \\I Tt ' <& 3 flPj-ih < 1 1 ^,0^— 1 1 1 2 < r*C, ||7r t '$j,oPnew||2 < «*,o || ^i.o-fj.newll 2 and Q <0 = 
||*i,oPiew||2 < 1, we get that ||e t || 2 < fouCy/ui* + ^o«*,o\/c7new The denseness assumption on Pj-_x; ||$j, -Pi-i ||a < 
and 0o < 1/(1 — 5 s ($j t o)) ensure that c/) is only slightly more than one (see Lemma |28l. If ^/C < 1/7*' tne r ' rst term m tne 
bound on ||et|| 2 is of the order of -\/C and hence negligible. The denseness assumption on "I^ oPj.new, whose columns span 
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First Projection PCA Second Projection PCA Update Pj when the last Projection PCA Is done 

1 y ' 

K times Projection PCA 



Fig. 2. The projection PCA algorithm. 



the currently unestimated part of span(Pj new ), ensures that k s ,o is significantly less than one. As a result, (/>ok S} o < 1 and 
so the error || || 2 is of the order of y^new- Since 7 new <C S m \ n and c is assumed to be small, thus, \\e t \\2 — \\St — St\\2 is 
small compared with 1 1 <St 1 1 2 , i.e. St is recovered accurately. With each projection PCA step, as we explain below, the error et 
becomes even smaller. 

Since L t — M t — St (step 2), et also satisfies e t — L t — L t . Thus, a small et means that L t is also recovered accurately. The 
estimated L t 's are used to obtain new estimates of -P^new every a frames for a total of Ka frames via a modification of the 
standard PCA procedure, which we call projection PCA (step 3). We illustrate the projection PCA algorithm in Fig |2] In the 
first projection PCA step, we get the first estimate of Pj >Dev/ , Pj.mw,i- For the next a frame interval, P(t-i) = [Pj -li -Pi,new,i] 
and so &u) = Using this in the projected CS step reduces the projection noise, /3j, and hence the reconstruction error, e*, 
for this interval, as long as 7 ne w,fc increases slowly enough^. Smaller et makes the perturbation seen by the second projection 
PCA step even smaller, thus resulting in an improved second estimate Pj <n ev/,2- Within K updates (K chosen as given in 
Theorem [T8li. under mild assumptions, it can be shown that both 1 1 1 1 2 an d me subspace error drop down to a constant times 
At this time, we update Pj as Pj = [Pj-i, -Pj,new,if]. We should mention that the idea of projecting perpendicular to a 
partly estimated subspace before PCA has been used in other contexts in past work [30 1, |fT3l . 

The reason we need the projection PCA algorithm as given in step 3 of Algorithm^is because the error et — Lt — Lt — St — St 
is correlated with L t . We explain this point in detail in Appendix [F] We also discuss there some other alternatives. 

B. Performance Guarantees 

We state the main result here first and then discuss it in the next two subsections. The proof outline is given in Sec [V] and 
the actual proof is given in the subsequent sections. 

Theorem 18: Consider Algorithm Q] Let c := c mx and r :— ro + ( J — l)c. Assume that L t obeys the model given in 

Sec. IIII-AI and there are a total of J change times. Assume also that the initial subspace estimate is accurate enough, i.e. 

||(J - P F%)P \\ < roC. for a C that satisfies 

. 10- 4 1.5 x IP' 4 1 A+ 
C < min(— 5-, — , -5-^) where / := — 

If the following conditions hold: 

1) the algorithm parameters are set as £ = £o(C)> < w < S m in — 7p£, K = K((), a > a a dd(CX where 

£o(C)j P, K((), a a dd(C) are defined in Definition [T9l 

3 For t S Ij,fe, ||et||2 < 4>h— l r *C\A^T* H" ^k— l^s,k— lCj,k— lv^Tnew.fc. By Lemma [28l <f>k—i is only a little more than one; by assumption, K Sl k—l is 
significantly less than one; and as we explain in Sec IV-CI £j &_i can be shown to decrease exponentially with k. If 7new,fc increases at a slower rate than 
the rate of decrease of Cj fe— l> then the second term in the bound on | [ | [ 2 also decreases expoentially with k. 
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2) Pj—li Pj.newi Dj,new,k • (-^ Pj—lPj—l Pj z new,kPj new k)Pj,new and Qj,new,k • {J- Pj.newPj.new )-^j,new,fc have dense 

enough columns, i.e. 

K2s{Pj-l) < 0.3, maXK 2s (-Pj,new) < 0.15, 

j 

max max k,2s(Di new k) < 0.15, max max K2s(Qi new k) < 0.15 

j 0<fc<_ff j 0</c<if J ' 

with Fj !n ew,o = [•] (empty matrix). 

3) for a given value of S m i n , the subspace change is slow enough, i.e. 

max(tj + i — tj) > Ka 1 

3 

max max ||a tine w||oo < 7new,fc := mm(1.2 fc_1 7 ne w, 7*): for all k = 1, 2, . . . K, 

j t j +(k-l)a<t<t j +ka 

14^o(C) < S min , 

4) the condition number of the covariance matrix of a t new averaged over t 6 2j,fc, is bounded, i.e. 

9j,k < V2 

where gj^ is defined in Definition [T9l 
then, with probability at least (1 — n~ 10 ), at all times, t, all of the following hold: 

1) at all times, t, 

f t = T t and 

||et||a = \\L t - Lth = lift - S t \\ 2 < 0.18^7new + 1.2^(^+0.06^)- 

2) the subspace error SE( t ) := — P^ t )P^)P(t)\\2 satisfies 



SE (t) < 



< 



(r + (j -iWC + OAcC + otf- 1 if teij, k , k = i,2...K 

(ro+jc)C if t€lj tK +i 

'lo-VC + o.e*- 1 if teXj,k, k = i,2...K 

10- 2 VC if t£Xj, K+ i 



3) the error e t = St — St = L t — L t satisfies the following at various times 

l|et|| 2 < 



0.18^0.72 fe - 1 7new + 1.2(Vf+0.06^)(r + (j-l)c)C7* if teX j>k , k = l,2...K 
1.2(r +jc)CVrj* if t€lj, K +i 



< 



0.18 v ^0.72 fe - 1 7new + 1.2(VF+0.06V5)v^ if te%, k = l,2...K 



1.2y/fy/Z if teX jtK +l 

This result says the following. Consider Algorithm Q] Assume that the initial subspace error is small enough. If (a) the 
algorithm parameters are set appropriately; (b) the matrices defining the previous subspace, the newly added subspace, and the 
currently unestimated part of the newly added subspace are dense enough; (c) the subspace change is slow enough; and (d) the 
condition number of the average covariance matrix of a f new is small enough, then, w.h.p., we will get exact support recovery 
at all times. Moreover, the sparse recovery error will always be bounded by 0.18y / c7new plus a constant times \f(,. Since £ is 
very small, 7„ ew <C S m [ n , and c is also small, the normalized reconstruction error for recovering St will be small at all times. 

In the second conclusion, we bound the subspace estimation error, SE( t ). When a subspace change occurs, this error is 
initially bounded by one. The above result shows that, w.h.p., with each projection PCA step, this error decays exponentially 
and falls below 0.01-y/C within K projection PCA steps. The third conclusion shows that, with each projection PCA step, 
w.h.p., the sparse recovery error as well as the error in recovering L t also decay in a similar fashion. 



13 



C. Discussion 

First consider the choices of a and of K. Notice that K = K(() is larger if £ is smaller. Also, a a dd is inversely proportional 
to (. Thus, if we want to achieve a smaller lowest error level, £, we need to compute projection PCA over larger durations a 
and we need more number of projection PCA steps K. 

Now consider the assumptions made on the model. We assume slow subspace change, i.e. the delay between change times 
is large enough, ||at, n ew||oo is initially below 7 new and increases gradually, and 14p£o < 5 m i n which holds if c mx and 7 new are 
small enough. Small c mx , small initial a^new (i-e. small 7 ne w) and its gradual increase are verified for real video data in Sec. 
IX-AI As explained there, one cannot estimate the delay between change times with just one video sequence of a particular 
type (need an ensemble) and hence the first assumption cannot be verified. 

We also assume that condition number of the average covariance matrix of a t . new , is not too large. This is an assumption 
made for simplicity. It can be removed if the newly added eigenvalues can be separated into clusters so that the condition 
number of each cluster is small (even though the overall condition number is large). This latter assumption is usually true 
for real data. Under this assumption, we can use the cluster projection PCA approach described in ||3T| for ReProCS with 
deletion. The idea is to use projection PCA to first only recover the eigenvectors corresponding to the cluster with the largest 
eigenvalues; then project perpendicular to these and Pj-% to recover the eigenvectors for the next cluster and so on. 

Other than these, we assume the independence of ctt's over time. This is done so that we can use the matrix Hoeffding 
inequality |fl9l Theorem 1.3] to obtain high probability bounds on the terms in the subspace error bound. In simulations, and 
in experiments with real data, we are able to also deal with correlated at's. In future work, it should be possible to replace 
independence by a milder assumption, e.g. a random walk model on the at's. In that case, at tj + ka— 1, one would compute the 
eigenvectors of (1/a) Yltez- k ^jfii^t ~ Lt-i)(Lt — Lt-i)'^ o- Moreover, one may need to use the matrix Azuma inequality 
|[T9l Theorem 7.1] instead of Hoeffding to bound the terms in the subspace error bound. 

Finally, we assume denseness of Pj-i and Pj >mv/ as well as of -Dj.new.fc and Qj. ne w,k in condition 2. As we explain in Sec 
IIV-DI denseness of Pj-i and P/new is a subset of the assumptions made in earlier works [[5] . It is valid for the video application 
because typically the changes of the background sequence are global, e.g. due to illumination variation affecting the entire 
image or due to textural changes such as water motion or tree leaves' motion etc. Thus, most columns of the matrix £ t are 
dense and consequently the same is true for any basis matrix for span(£ t ). Now consider denseness of -Dj, n ew.fc whose columns 
span the currently unestimated part of the newly added subspace. Our proof actually only needs ||iT t '-Dj,new,fc||2/||-Dj,new,fe||2 
to be small at every projection PCA time, t = tj + ka — 1. We attempted to verify this in simulations done with a dense Pj 
and -Pj.new- Except for the case of exactly constant support of St, in all other cases (including the case of very gradual support 
change, e.g. the models considered in Sec IX-Cb . this ratio was small for most projection PCA times. We also saw that even 
if at a few projection PCA times, this ratio was close to one, that just meant that, at those times, the subspace error remained 
roughly equal to that at the previous time. As a result, a larger K was required for the subspace error to become small enough. 
It did not mean that the algorithm became unstable. It should be possible to use a similar idea to modify our result as well. 
An analogous discussion applies also to Qj. ne w,fc- In fact denseness of Qj. n ew,fc is not essential, it is possible to prove a slightly 
more complicated version of Theorem [T8l without assuming denseness of Qj. ne w,k- 

D. Discussion w.r.t. the PCP result 

First of all, as mentioned earlier, we solve a recursive version of the robust PCA problem where as PCP in [5 1 solves a batch 
one. Also, the proof techniques used are very different. Hence it is impossible to do a direct comparison of the two results. 

The first key difference between our result and that of PCP |]5] is as follows. The result of [5] assumes that any element of 
the n x t matrix St is nonzero w.p. g, and zero w.p. 1 — g, independent of all others (in particular, this means that the support 
sets of the different Sj's are independent over time). This assumption ensures that w.h.p. S t is sparse but full rank and hence 
ensures that it can be separated from C t which is low rank but dense. As explained earlier, the assumption of independent 
support sets of St is not valid for real video data where the foreground objects usually move in a highly correlated fashion 
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over time. On the other hand, our result for ReProCS does not put any such assumption on the support sets of the St's. In 
simulations, we show examples where the support is generated in a highly correlated fashion thus resulting in a low rank and 
sparse St and ReProCS is still able to recover St accurately. The reason it can do this is because it assumes accurate knowledge 
of the subspace spanned by the first few columns of Ct and it assumes slow subspace change. However, ReProCS does need 
denseness of -Dj^ew.fc and in simulations, we observe that this reduces when the support of St changes very infrequently. 

Next let us compare the denseness assumptions. Let Ct = ITEV be its SVD. Then, for t G [tj,tj+i —1],U= [Pj-i, Pj,new] 
and the a*'s are the various columns of the matrix T,V. Thus V — [<xi, a-x . . . at]'S _1 . PCP |5| assumes denseness of U and of 
V: it requires Ki(U) < \J fir/n and Ki(V) < \J fir/n for a constant p, > 1. Moreover, it also requires ||t/V|| ma x < \[Wl n - 
Here ||B|| ma x := maxjj |(-B)-y |. This last assumption is a particularly strong one. On the other hand, our denseness assumptions 
are on Pj-i and Pj. ne w which are sub-matrices of U, and on Dj, n ew,)fc whose columns span the currently unestimated part of 
span(Pj.new)- We do not need denseness of V and we do need a bound on ||fV||max- However, an additional assumption that 
we do need is the independence of at's over time. We have discussed above in Sec HV-Cl how this can possibly be relaxed. 

V. Definitions and Proof Outline for Theorem[T"81 

In Sec IV-AI we define the various quantities that will be used in the lemmas leading to the proof of Theorem [18] Next, we 
give the proof outline in Sec. IV-BI The reason for exponential decay of the subspace error with every projection PCA step is 
briefly explained in Sec IV-CI The lemmas leading to the proof of the theorem, and their proofs, are given in the subsequent 
sections - Sec |VT] and Sec IVIII The theorem's proof follows easily using these lemmas. This is given in Sec IVIIII 

A. Definitions 

A few quantities are already defined in the model (Sec IIII-Al i. in Algorithm Q] and in Theorem [18] Here we define more 
quantities needed for the proofs. 

Definition 19: We define here the parameters used in Theorem [T8l 

1) Define K{Q := [^gr 1 " 

2) Define £o(C) := V^ 7 „ew + V((Vr + y/c) 

3) Define p := maxt{Ki(St jCS — St)}. Notice that p < 1. 

4) Let K = K(Q. We define a a( jd(C) as the smallest value of a so that (pjf(a, C)) KJ > 1 — n~ 10 , where Pif(a, C) is 
defined in Lemma [35] We can compute an explicit value for a a dd by using the fact that for any x < 1 and r > 1, 
(1 — x) r > 1 — rx. This gives us 

8 24^ 16 
«add = [(log6AV + 11 logn) ^ 2(A _ )2 max(mm(1.2 4g 7n 4 ew , 7 4 ), — , 4(0.186 7n 2 ew + 0.0034 7new + 2.3) 2 )1 

In words, a a dd is the smallest value of the number of data points, a, needed for one projection PCA step to ensure that 
Theorem [T8l holds w.p. at least (1 — n~ w ). 

5) Define the condition number of Cov(at, ne w) averaged over t £ Xj.k as 

\ + 

A 7, new. k . 

g 3 ,k ■= y — - where 



A 



7,new,fe • — A max ( y (A^) new ), Xj new ,k ■ A m i n ( } (A^ 



Notice that A < Aj. new .fc < Aj !n ew.fe + < A + and thus g.j t k < / = A+/A . Recall that At = Covfat] = E(atat'), 
(A t ) new = E(at,newat inew ), A - = min t A m i n (A t ) and A" = max t A max (A t ). 
Definition 20: We define the noise seen by the sparse recovery step at time t as 

fit ■.= \\{I-P { t-x ) P[ t _ 1) )L t \\ 2 . 

Also the reconstruction error of St is 
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Here St is the final estimate of St after the LS step. Notice that e* also satisfies et — L t — L t . 

Definition 21: We define the subspace estimation errors as follows. Recall that Pj ine w,o = [■] (empty matrix). 

SE (t) :=||(/-P (t) P ( ' t) )P (t) || 2 , 
0,* ■.= \\{I-P j -iP' j _ l )P j -ih 

Cj,k '■= — Pj-lPj-l — Pj,new,kPj. nev/ ^k)Pj,new\\2 

Remark 22: Recall from the model given in Sec IIII-AI and from Algorithm Q] that 

1) ^j.new.fc is orthogonal to P j - 1 , i.e. P,' new>fe P/-i = 

2) Pj_ x := [P ,i\ 

.new, K, ■ ■ -Pj-1 . new . k] and Pj-i := [Po,Pi ,new: ■ • ■ 1 j — l,newj 

3) for t £ Zj,k+i, P(t) = [Pj-i,Pj,new,fc] and P( t ) = P, = [Pj-1, P/.new]- 

4) d> (t) ^I-P^i) 

From Definition [21] and the above, it is easy to see that 

1) (j> < Ci,* + Z)j-'=i 0',k 

2) SE( t ) < Cj> + Cj,* < &,* + X^'=i Cj'.ir + Cj,k f° r * £ ^j,fe+i' 

Definition 23: Define the following 

1) fe , $ i)0 and fc 

a) := I — P / _iPj_ 1 - Pj,B£w,kPj tneWtk is the CS matrix for * € X,,fe+i, i.e. $( t) = for this duration. 

b) $j,o := I — Pj-iPj_i is the CS matrix for t 6 i-e- &(t) — $j,o f° r tr, is duration. is also the matrix used 
for projection PCA for t 6 [tj,tj+i — 1]. 

c) cj) k := maxjmax T: | T |< s 1 1 ( (3>j, fe )-r ' (3>j, fe )r) _ 1 1 1 2- It is easy to see that fc < 1 _ maXj 1 Sa ^. k y 

2) Dj,nevi,k, -Dj'.new and Dj,* 

a) -Dj,new,fc := ®j,kPj,new span(-Dj ine w,fc) is the unestimated part of the newly added subspace for any t € Ij k+x. 

b) Dj !nevl :— Dj,new,o = ^j,o-P)",new span(Pj !new ) is interpreted similarly for any t e Xj,i. 

c) Dj t * t k := $>j t kPj-x- span(Dj ) * i fe) is the unestimated part of the existing subspace for any t € 

d) Dj t := Dj^fl = $j oPj-i- span^j^.fc) is interpreted similarly for any t 6 

e) Notice that £,\o = ||-Dj>ew||2, Cj.fc = ||-Dj,new,fc||2, Cj,* = \\ D jAU- Also, clearly, ||D 3 >,jfe||2 < (?,*■ 
Definition 24: 

QR 

1) Let Dj^ new/ = P,',newPj.new denote its QR decomposition. Here Ej >new is a basis matrix while Pj.„ew is upper triangular. 

2) Let Pj jne w,_L be a basis matrix for the orthogonal complement of span(£y >new ) = span(Z?j new ). To be precise, Ej iBSVI .± 
is a n x (n — Cj inew ) basis matrix that satisfies E'- new j_Ej jmsw = 0. 

3) Using Bj,new and P Jjn ew,_L, define A jtk , A jtk< ±, H jth , //,.;,. and B jik as 



teij, k 

Aj,k,± '■= — J]] EjflBW,± , '&j,oLtLt&j,oEj 1 ng W ,± 

Hj,k'-—— Ej,new'$j,o(etet — Ltet — e t Lt)$>jfiEj tnev/ 
Hj,k,± '■= — ^2 Ej,new,x'®j,a( e t e t ~ L t e t ' — e t L t ')^j t oEj !mWi ± 



a 



B 



i.k'-—— J]] Ej^w^QjoLtL^joEj^ew — — Ej, neWi ±^jfi{Lt ~ e t )(Lt — e t r )^jfiEj^ ev/ 
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4) Define 



Aj k 



-^7, new -&j,new,A- 



A 3,k o 

A jtk ,x_ 
Hj.k Bj t k 
Bj,k Hj,k,± 



E, 



j,new,_L 

J? ' 

J-J'i. new 



E 



j,new,_L 



5) From the above, it is easy to see that 



6) Recall from Algorithm [Qthat Aj^+H 



EVD 



P 



"A fc 






_ A fe , ± _ 




P' 

j,new,fc,_L 



is the EVD of .4 



-H 



Here P/ ine w,fc is a n x Cj. new basis matrix. 
7) Using the above, .4.^ + 'Hj.fc can be decomposed in two ways as follows: 



A 3 , k +Uj, k 



P, 



"A fc 




pi 

j,new,A: 


. 




pi 

j,new,fc,_L_ 



F, 



new &j,new,A- 



B j,k 



Aj,k + Fj.fc 

Pj,fc Aj ; k,x + Hj,k,± 



E; 



E 



j,new,_L 



Remark 25: Thus, from the above definition, Hj^ = — [$o St(~^< e t ~ e t-^t + e t e t)^o + F + F'] where 
F := S neWi _L^ ew ± $o St -^i-^i^O-^new^ew = ^mw,x-^w,±(-^*,fc-i a i,*)(^*,fc-iOt,* + Dnew,k-i a t,newy E nev/ E' new . Since 
E[a M < new ]=0, ||iF|| 2 < r 2 C 2 A+ w.h.p. 
Definition 26: In the sequel, we let 
1) r := r + ( J - l)c mx and c 



2) ^S,* 



P P 

J j,new J j, new 



ma: 

lliaXj K s (Jj'_i), /^s.new : = 

OPj t new,Jfc)» £fc := max., g j)k , 



HlQJCj K s \Pj t new)i ^s,k 



new,fc ) 



K s,fc 



maxj k s ((I 



3) k. 



2s * := 0.3, k,2 S new := 0.15, := 0.15, kJs := 0-15 and g + := V2 are the upper bounds assumed in Theorem IT~8l on 

maxj K2s(Qj,new,fc) and max., maxfc g^fe respectively. 

4) <j) + := 1.1735 is the upper bound on </>fc that follows using the above bounds (see Fact |29l), 

5) C+* := r„C + (j - IK 

6) 7new,fc := min(1.2 fe_1 7new,7*) (reecall that the theorem assumes max 3 max j . + ( t _ 1)a < Kf ||at, ne wlU < 7new,fc) 

7) Pj * := and i\* := Pj_i (the point of doing this becomes clear in the next remark). 

Remark 27: Notice that the subscript j always appears as the first subscript, while k is the last one. At many places in this 
paper, we remove the subscript j for simplicity. Whenever there is only one subscript, it refers to the value of k, e.g., <E>o 
refers to &j,o, P ne w,fc refers to Pj, n ew,k- Also, P* := Pj-i and P := Pj-%. 



B. Proof Outline for Theorem [7S] 

The proof of the theorem is an easy consequence of the following lemmas. 

1) In Lemma l28l we use Lemma [T6l to bound the RIC for the CS measurement matrices, i.e. we bound 8 s {^jfi) and 

5 s (&j,k), in terms of the denseness coefficients K s (Pj-i) and re s (Pj,new) and the subspace errors (j,* and Q t k. 

2) Let the bound on be = {tq + (j — l)c)£ and that on Q,k-i be Ck-i f° r a ^ 3- 

3) In Lemma [30l assuming that < £f t £j',fc-i < Ck-i> C£-i — 0.6 fc_1 + 0.4c£ and the first three conditions of the 
theorem hold, we show the following for all t G Ij,k< k = 1,. . . (K + 1). 

a) We bound \\PtW2 m terms of Cj,k-l and 

b) Next, we show that ||ft||2 < £ (with £ chosen as given in the theorem). We use this, Lemma |281 and Theorem [5] 
(CS result) to bound the CS error \\S t , C s - Sth- 

c) Next, we show that if the support estimation threshold ui is chosen as given in the theorem, then T t = T t . 
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d) With f t — If, we are able to give an exact expression for the LS step error, e t := St — St and also bound it. Recall 
that et is also equal to L t — ft- 

4) In Lemma [3T1 we use the sin6> theorem and Weyl's theorem (Theorems |7] and [8) to bound the subspace error Q } k for 
projection PCA done at t = t j + ka — 1 in terms of the perturbation matrix, Hj,k, and the various components of the 
decomposition of Aj k given in Definition l24l 

5) Let T e j k denote the event that (i) &,* < r C, Cj',k> < Cj£ for al l 1 < / < 3~h <k' <K, and Q tk > < (+, for all < 
k' < k, and (ii) f t = T t and e t satisfies © for all t <tj + ka~ 1. Under the assumption that ( k < 0.6 fc + 0.4c£, with 
K defined as in the theorem, it is clear that Cj'.K < Ck — C C- Thus, Tj k implies that Q it < y„ = (ro + (j — l)c)C 
(this is easy to see using Remark I221. 

6) In Lemmas [33] and [34l under the assumption that Ck—i < 0.6 fc_1 + 0Ac( and the conditions of the theorem hold, we 
obtain high probability bounds on the various terms in the bound on Q t k from Lemma |3T1 conditioned on T e - k _ 1 . 

a) These lemmas first use Lemma |30l to show that T t = T t and thus e f has an exact expression given by © and then 
apply the matrix Hoeffding inequality (Corollary 1131 or Corollary FBI. Lemma [TOl and Fact |29l are used to obtain 
the final expressions for the bounds and the probabilities with which they hold. 

b) A by-product is the following conclusion. Conditioned on k _ 1 , the event that T t = T t and et satisfies (© for all 
t £ Ij t k holds with probability one. 

7) In Lemma[35l under the assumption that Ct-i — 0.6 fc_1 + 0.4cC and the four conditions of the theorem hold, we combine 
the bound of Lemma [5T| with the bounds on its individual terms from Lemmas [33] and [34] to obtain a high probability 
upper bound on Q^, conditioned on r| fc _ 1 . The obtained bound is a function of Ck—v Cj* anc * °^ ^ e r,ounc ^ s on 
K s (Dj,Dem,k) and on gj t k- We use this upper bound to define ( k in Definition [361 

8) In Lemma [37] assuming that the four conditions of the theorem hold, we show that as defined in Definition [36] 
decreases with k and that it indeed satisfies < 0.6 k + OAcC, for all k < K. 

9) Lemma |40] combines the results of Lemma [35] and Lemma [37] It shows that just under the assumptions of the theorem, 
given rj k _ v the event that Q,k < < 0.6 fc + 0.4cC and that f t = T t and e t satisfies © for all t e T^ k holds with a 
certain probability that depends on a and (. 

The proof of the theorem follows easily by applying Lemma |40]for each j and k and finally using Lemma [37] and the definition 
of K. In the end, we use the definition of a a dd and a > a a dd to show that the the result holds w.p. at least 1 — ?i~ 10 . Thus, 
for large enough n, the result holds w.h.p. 

C. The reason for exponential decay of the bound on Q t k 
First note that Q t o < 1, So we let = 1, Assuming that the four conditions of the theorem hold and assuming Ct—i — 

e 

3,k- 

C<ff + C,t 1 +g , (C + ) 2 / + 0-125cC 
1 - Cn+g+Ct, - (C + 1)(C* + ) 2 / - 0.25cC 
w.h.p. The constants C, C are small but more than one (see Definition l36l l. Also < means that the RHS contains only the 
dominant terms. We define the series fjjjT as — 1 and Q k being the RHS of the bound on Q^. Thus, 

, Cn+g+ . C ' (ro+(j - 1)c)2/ C + 0.125 

^ 1 - Cn+g+Cti - (C + 1)(C* + ) 2 / - 0.25cC^- X 1 - CK+g+Ct- X - (C + 1)(C* + ) 2 / - 0.25cC S 



0.6 fe 1 + 0.4c£, Lemma [35] shows that, conditioned on T e - k _ 1 



In Lemma [37] we use the facts that ( < 10" 4 /r 2 and C < -°°° 15 and the values of «+, g+ to show that Ci + < 0.5985 < 
1 = Co • Next, we use this and the fact that C, k is an increasing function of ( k _ 1 to show that Q" < Ct-i — 0-5985 for all 
jfe > 1. Finally, we use C < a °°° 15 , Ck-i < 0.5985 and the values of k+ and g+ in © to show that < 0.6Cfc_i + 0.16cC 
This gives us £ k < 0.6 fc + 0.4cC- Combined with Lemma [331 this means that, conditioned on k _ lt < O.Q k + 0Ac( 
w.h.p. Thus, within K — r^f^Tj^l iterations, Q k < cC,. Combined with Q < (f = (r + (j — l)c)C, this means that the 
subspace error at any t € lj.K+i satisfies SE( t ) < (ro + jc)£. 
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VI. Key Lemmas - 1: Bounding the RIC, sparse recovery and LS error and subspace estimation error 

At most places in this and the next section, we remove the subscript j for simplicity. Whenever this is done, the convention 
stated in Remark |271 applies. Also recall that P» := Pj-i and P» := Pj-%. 

We first bound the RIC of $fc in terms of the denseness coefficients of P* and P new and their estimation errors. Next, we 
use these to bound the sparse recovery and LS error. Finally, we obtain a bound on the subspace estimation error at the k th 
projection PCA step in terms of the various matrices used in the decomposition of the Ak and Hk given in Definition l24l 



A. Bounding the RIC of 

Lemma 28 (Bounding the RIC of $ fc J: Recall that C* := ||(/ - P*H)P* h- The following hold. 

1) Suppose that a basis matrix P can be split as P = [PijP^] where Pi and P2 are also basis matrices. Then k 2 (P) = 
max T: | T |< s ||/^P||| < k*(I\) + K 2 S {P 2 ). 

2) k2(A)<<, + 2& 

4) ^(*o) = «2(A)<k2,.+2C, 

5) 8 s ($ k ) = K 2 S ([P* Pnew.fc]) < + ^(P„ew,fc) < K*. + K* + (« S ,new + « S ,fcCfe + C*) 2 for * > 1 

Proof: 

1) Since P is a basis matrix, k 2 s (P) = max m < s \\I T 'P\\l Also, \\I T ' P\\% = W[P\, ft] [ft, ftRrlh = W {PiP[ + 
PzPDlrh < \\It'PiP{It\\2 + Ut 'Pz^rh- Thus, the inequality follows. 

2) For any set T with |T| < s, \\I T ' P*\\l = W P*P'jTh = W ' {P*H - P*P* + P*P*')I T h < W(P.H - 
P*P*')I T h + II I t 'P*P*'It II 2 < 2C* + The last inequality follows using Lemma [Tol with P = P* and P = P*. 

3) By Lemma [10] with P = P*, P = P* and Q = P new , ||P ne w'P*||2 < C*- By Lemma [TO] with P = P new and 

P = Pnew.fc, W - P„ewP4w)-Pnew,fc|| 2 - W ~ P,ew,fcP4w,fc) P new || 2- For any Set T with |T| < S, || I T 'Pnew,fe || 2 < 
||7t (/ — Pnewfnew)-fnew,fc||2 + II^T PiewPnew-Pnew,fc || 2 < K Sl fe||(/ — PiewPiew )Piew,fc||2 + ||-^T Piewlb = K Sy k\\(I — 
Pnew,feP n 'ew.fe)-Pnew||2 + 1 1 -^'Piew 1 1 2 < «s,fe 1 1 Aiew.fe 1 1 2 + Hs,k 1 1 -P -P* Piew 1 1 2 + 1 1 ir -Pnew 1 1 2 < Ks,fcCfc + K Sy k(* + ^s,new < 

Ks.feCfc + C* + K s,new Taking max over |T| < s the claim follows. 

4) This follows using Lemma [T6l and the second claim of this lemma. 

5) This follows using Lemma [T6l and the first three claims of this lemma. 



B. Simple facts used at various places 

Let ri~ denote the bound on Q ^ for any j. We obtain an expression for later. 

Fact 29: Suppose K2 S ,* < « + s » = ^-3, «2s,new < K *L new = 0.15, K 2;j ,fc < k + s = 0.15, and n s> u < k+ = 0.15. Pick £ as in 
Theorem [TSI and set £+ = (r + (j — l)c)£. Then, 

D C> < (ro+( ,/ ? 1)c)a /. < VC 

2) C+ < (^TR < 10- 4 

3) C+7 n 2 ew , fc < C + 7 2 < (ro+(J -i )c) . < 1 

4) C* + 7new,fc < C+7* < i #. n < VC 

V r o + (./-l)c 

5) Cf < H^TY < 1-5 x 10- 4 

6) If C*"_ x < O^-^O^cC, then CtiTmw.fc < (0.6- 1.2) fe - 1 7new + 0.4cC7* < 0.72 fc - 1 7new + °; 4 ^ < 0.72 fc - 1 7new + 
0.4VC 

7) If Cti < O.e^ 1 + 0.4cC, then Cif_i7^ < (0-6 ■ l^ 2 )*" 1 ^ + 0.4<7 2 < 0.864 fc - 1 7l 2 ew + (ro+( °jl 1)e)a < 
0.864 fc - 1 7n 2 ew + 0.4 

8) If C* < C + , Cfe < and C < 0.6 fc + 0.4cC, then 
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a) 6 S ($ ) < S 2s ^ ) < k+/ + 2C+ < 0.1 < 0.1479 

b) 6 s ($ k ) < 5 2a ($ k ) < n+J + 2C+ + (4,,new + ~4s,kC + C) 2 < 0.1479 

c) <j>k < n^H) < !- 1735 

Proof: The first seven items follow directly. The eighth item follows using Lemma [28] ■ 

C. Bounding the Sparse Recovery and LS Error 

Lemma 30 (Sparse Recovery and LS Error): Pick £ as given in Theorem [T8l and let £+ := (tq + (j — l)c)C- Let £o> P be 
as defined in Theorem [18] If 

1 ) the first three conditions of Theorem [18] hold, 

2) C* < C + := (r + (j ~ l)c)C and 

3) Cfc-i <(£-i < 0.6 fc - x + 0.4cC 

then for all i e X,,fe, for any 1 < fe < X + 1, 

1) the projection noise B t satisfies ||ft|| 2 < v / c0.72 fe_1 7 ne w + V((Vr + 0.4^2) < Co- 

2) the CS error satisfies \\St, cs — <St||2 < 7£o- 

3) T t = T t 

4) e t satisfies 

(8) 

and ||e t || 2 <0.18V^0.72 fe - 1 7new + 1.2VC(\/^+0.06^). 
Proof: 

1) For f G X/.fc, ft := (/ - P{t-i)P[ t -i)) L * = D *,k-l<H,* + Aiew,fc-ia t , n ew Thus, ||ft|| 2 < C*V^1* + Cfc-iVc7new,fc < 
v / c0.72' c_1 7 new + VClv 7 ^ + OAy/c) < £,q. The second last inequality follows using Fact|29l 

2) By Fact [29] and condition 2 of the theorem, S 2s {^k-i) < 0.15 < \/2 - 1. Given [Tt] < s, ||/3 t || 2 < Co = £ and 

< V2 - 1, by Theorem[5] the CS error satisfies ||5 i)CS - £ t || 2 < J^S^X- I)^ < 7f . 

3) Using the above and the definition of p, ||St,cs ~ St\\oo < 7p£o- Since mint |(<St):r t | > <Smin and {St)rf = 0, 
rnin t |(£t, cs ) Tt | > S min ~ 7p£, and min t |(S t)CS ) T e| < 7p£ . If cj < SU, - 7p£ , then f t D T t . On the 
other hand, if cj > 7p£o, then T t C T t . Since S m i n > 14p£o(condition 3 of the theorem) and to satisfies 
7p£o < w < S'min — 7p£o (condition 1 of the theorem), then the support of St is exactly recovered, i.e. T t =T t . 

4) Given f t = T t , the LS estimate of S t satisfies (S t ) Tt = [^k-^Vt = [(*fc-i)r t ] t (*fc-i5 t + $ fc _iL t ) and (£ t ) T c = 
fori G Xj tk . Also, (^fc^r/^! = I Tt '®k-i (this follows since ($ fc _i) Tl = ^k-il% and $k-i $ ft-l = $ k -i)- Using 
this, the LS error e t := Sj — St satisfies ([8]). Thus, using Fact |29l and condition 2 of the theorem, 1 1 || 2 < <f> + (£+ y/rj* + 
K.,Jk-iCfc- 1 V^7b e w,fc < 1.2(V?VC + V^0.15(0.72) fe - 1 + ^0.06V^) = 0.18^0.72 fe - 1 7ne w + 1.2^(^ + 0.06^). 



D. Bounding the subspace estimation error 

The following lemma is a consequence of Weyl's theorem (Theorem [HJ and the sin# theorem (Theorem [7} 

Lemma 31: If X min (A k ) - \\A kt ±\\ 2 - \\Hkh > °> then 

> < ll^fclh < ll^fcb 

U " A min L4 fe ) - ||^ fci± || 2 - \\-Hkh ~ A min (A fc ) - ||A fcl± || a - ||W fc || 2 

where lZ k '■= HkE aem and A k , A k ,x> T~Lk are defined in Definition [24] 

Proof: Since \ min {A k ) - \\ A k> ±\\ 2 - \\Hk\U > 0, so A min (A fc ) > 1 1 ^4_a.-,_l 1 1 2 - Since A k is of size c new x c new and A min (A fe ) > 

||^4fc,jj2, A ClKW+ i(^lfe) = ||A fe) _L|| 2 . By definition of EVD, and since A k is a c new xc new matrix, A max (Afe,x) = A Cncw+ i(A+'Hfc)- 

By Weyl's theorem (Theorem D, A Cnow+ i(A +Kfe) < A Cnew+ i(A) + ||H fc || 2 = Mfe.±|| 2 + \\n k \\ 2 . Therefore, A max (A fe , x ) < 
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H^fc.xIb + U'Wjfella and hence X min (A k ) - A max (A fc ,j_) > A m i n (^4fe)-||A fe) x||2-||^fe||2 > 0. Apply the sinfl theorem (Theorem 
O with A min (A fc ) - A majc (A fc ,x) > 0, we get 

II /t 5 p/ ^p II <• \\ n kh ||K fe || 2 



Amin(^lfc) — A max (Afc ! j_) X m i a [A k ) — ||^4.fe,x||2 — H'Wfclh 

Since Ck = ~ P^kPi^Dnmh = W ~ P^.k^JKmRvmh ^ IK 7 " Aew.fc-P^w.J^wlk the result follows. The 
last inequality follows because 1 1 i? ne w 1 1 2 = H-^newAiewlh < 1. ■ 



VII. Key Lemmas - 2: Showing high probability exponential decay of the subspace error 

At most places in this section, we remove the subscript j for ease of notation. We retain it where needed, e.g. in defining 
the r.v. Xj k and in defining and using the set Tj k or for the time interval Ij.k- Also, recall that P„ := Pj-i and P„ := 

In this section, in Lemmas [331 and [341 under the assumption that ti—i — 0-6 fc_1 + 0.4cC and the four conditions of Theorem 
[l8]hold, we obtain high probability bounds on each of the terms of (O, conditioned on k l . The proofs of Lemmas [33] and 
[34] are given in Sec. I VH-Cl and [VH-D1 and Sec. IVII-Bl contains facts used in the proofs. Under the same assumptions, Lemma 
[35] combines the result of these two lemmas with to obtain a high probability upper bound on Q k conditioned on k _ v 
We use this upper bound to define in Definition [36] In Lemma [37] we show that, under the assumptions of Theorem [18] 
this Q" indeed satisfies Q~ < 0.6'° + 0.4c£. Lemma l40l then combines the results of Lemmas 1351 and [371 to finally conclude 
that just under the assumptions of Theorem [T8] Cfc < 0-6 fe + 0.4c£ w.h.p. This, along with < C*~> implies that the subspace 
error decays exponentially towards a constant times £ w.h.p. 



A. Obtaining high probability bounds on Q tk 

Recall that k+ ,* := 0.3 and k+ new = 0.15, k+ = 0.15, k+ = 0.15 and g+ = y/2 and 0+ = 1.1735 < 1.2. 
Definition 32: Define the following functions (we will see their utility in the lemmas that follow): 

C{x; u) := (1 + -J^L=) K +^ + (1 + -^L=)( K +)^+) V 



u 



0{u, v) := —(1 + <f>+ + -== + {4> + Y + 4 \ = ) 

J V 1 — vt v 1 — u 

ginc(x; u, v, w) := C(x; u)g + + 0(u, v)f + 0.125w 
gdec{x; u, v, w) :— 1 - u 2 - uv - 0.125w - g ln c(x; u, v, w) 

, , » ginc(x;u,v,w) 

fine{x; U, V, W) := r 

gdec{x;u, v, w) 

As we will see in the lemmas below, A~ w k ginc(Ck-i> C* > Cj,*f> C C) i s a high probability upper bound on 
ll^felb, A newfc g rfec (C A }l 1 ;C* + ,Cj>/' c C) is a high probability lower bound for A min (A fc ) - X max (A ki ±) - \\H k h and 
finc(Ck-v Gti C,>/> c is a high probability upper bound for 4- 

Lemma 33: Consider t € 2j.fc. Pick £ as given in Theorem IT~8l and let := (ro + (j — l)c)C- Assume that the four conditions 
of Theorem [T8l hold. Also, assume that we are given a series of constants Q", with £q = 1 and < O^ -1 + 0.4c£. Define 
the random variable 

Xj,k '■— [o-l, a 2, ■ ■ ■ O-tj+ka-l]- 

Define the set Tj^k a s follows. 

r Jifc := {X jtk : Ci,* < r C; 0',fe' < C for all / = 1, 2, . . .j - 1, k' = 0, 1, . . . K; < C for all fc' = 0, 1, . . . k } H 
{Xj,k -T t =T t and e t satisfies ® for all i < tj + ka - 1} 

Recall that P= fc denotes the event Xj ifc € T jtk . Assume that P(r^ fc _ 1 ) > for all 1 < k < K + 1. Then, 

1) for all 1 < fc < X, P(A min (A-) > A n - W fe (l - (C+) 2 - f|)l r i, fe -i) > l-P-,*(«,0- 
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2) for all 1 < k < K, P(X max (A k<1 _) < X^Mt) 2 f + fj\ r lk-i) > 1 -Pb{<*,Q where 

P-.*(«.C):=cexp(- g M2 . min(1 . 2 4 fc7 4 w;7 4 ) )+^xp(- 8 . 242 . 42 ) and 

p 6 (a, C) := (n - c) exp(- ^^ ). (10) 

Lemma 34: Under the same settings as Lemma [33] for all k > 1, 

1) P({f t = T t and e f satisfies © for all f G 2" 3 ,fc}|r^ fe _ 1 ) = 1. 

2) P(||«*||a < A- w>fc5i „ c (Cti;C + ,C + /,cC) |q fc _i) > l-p c (a,0 where 

8 .242(0.0324 7n 2 ew + 0.0072 7new + 0.0004) 2 Pl 32 • 24 2 (0.06 7n 2 ew + 0.0006 7new + 0.4) 2 ' 

+nexp(- 



32 • 24 2 (0.186 7 2 ew + 0.00034 7new + 2.3) 2 ; 
Lemma 35: Under the same settings as in Lemma [33] for all k > 1, 

1) If ftfcc(#-i; C* + , C* + /; cC) > 0, then P(C fc < fino(Ct-v C* + , OI^Vi) > p*(o, C) where 

Pfc(a,C) := 1 -Pa,fe(a,C) -Pb(a,C) -p c (a,C) (U) 

Proof: This lemma is an easy consequence of Lemmas [31] [33] and [34] ■ 

Definition 36: Define the series {Cfc + }fe=o.i,2. -- as follows 

C := 1, Ct ■= /mc(Cfe-i + ; C + , Ctf, cC), for fc > 1. (12) 
Using Definition [32] an explicit expression for £jjf is 

Cf = : ^ 2 6 t+! 2 l 5cC no C , : where 5 : = ^ + <ti + <5(4)V(C fc + _i) 2 + C'f(Cf, 



\A-(c + ) 2 Vi-(c« + ) 2 %A-(c, + ) 2 %/i-(c* + ) 2 u ; Vi-(c + ) 2 

B. Key facts for proving Lemmas \33\ and \34\ 

In this and the next two subsections, we use — Y] t to denote — Y] fl=T . 

Lemmas [33] and [34] can be proved using the following facts and Corollaries Qj] and [14] Under the assumptions of these 
lemmas, the following are true. 

1) Recall from the model (Sec lIII-Ab and from condition 3 of Theorem [T8l that (i) a t new and at,* are mutually uncorrected, 

(») IK,*ll2 < V^l*, (iii) for t € l^ k , \\ II 2 < \/c 7 new,fc and || 

0't,*0't, new || 2 < \/cf 7 new,fc 7 *- 

2) Recall that 

a) / := A+/A~ where A+ := max f A max (A t ) and A~ := min 4 A min (A f ) and so A+ w fc < A+, A~ wS , > A" 

b) $ = / - P*Pl, $fc-l = I - P*P: - Aew.fc-1-P4w.fc-1' Atfw,fc-1 = $fc-l-Pnew, £>new = £>new,0 - $0-Pnew Q = 
£ ne „i?new, D * = ®oP*, C* = Cfe-1 = ||-Dnew,fc-l|| with Co = ||Aiew||- 

c) Conditions 2 and 4 of Theorem [T8l imply that K2 S ,* < „ = 0.3 and /«2s,new < ^ts new = 0.15, K2 S .k < «2s = 0.15, 
K s ,fc < k+ = 0.15 and gr^fc < g + = \[2. 

d) The r.v. Xj^-i and the set Yj t k-i are defined in Lemma [331 

3) It is easy tO See that ||$ fc _iP»|| 2 < C*. Co = HAiewlh < 1, $oA iew = ^O-Cnew = A,ew, ||-Rnew|| < 1, 

HCRnew)- 1 !! < 1/v/T - ^, ^„ew,±'A K w = 0, and \\E R J ^ a e t \\ = ||(i?; ew )- 1 A ew $ e t || = ||(i? ne w)" 1 -D new e t || < 

ll(-Rnew)" 1 - D n ew 7 T t ||||et|| < - 

fact that CTi(i?„ew) = <Ti(D new ) 



ll(Kew) 1 - D iew- r T t ||||et|| < , 3 ||et||. The bounds on ||i? n ew|| and || (i? new ) 1 || follows using Lemma [10] and the 

y i-C, 



4) Xj k-i € T^fc-i implies that < Cjf , and < We prove this below. This, in turn, implies that 
a) A m i n (i? new i? new ') > 1 — (C^) 2 - This follows from Lemma[l0]and the fact that (7 m iii(i?new) = a m i n (D n 
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b) ||J T /*fc-lP.||2 < \\*k-xP*h < C* < C Un'D^-xh < K a , k -iCk-l < ^tCk-V 

c) fc _ x := \\[(&k-i)r t X*k-i)T t ]- 1 h = 1-2. This follows from Fact|29] 

5) P({Ti = TJ and et satisfies ([8]l for all t € Zj^j-lX^fc-i) = 1 for all Xj^-x G Ij- k_i. We prove this below. In other 
words, conditioned on Xj^-x, T t — T t and e t satisfies 

w.p. one, for all Xj t k-i € T^fc-i. 

6) The matrices Z? new , Pnew> -Knew, A*, Aew,fc-i> ^fe-i are functions of the r.v. Xj t k-x (defined in Lemma [331. 

a) Thus, all terms that we bound in the proof of Lemma [33] are of the form i Y^tex- k ^* wriere can t> e 
rewritten as either Z t = /i(X Ji/£ _i)a M a^/ 2 (X,,/c-i) or Z t = f 1 {X j . k - 1 )a t . new a' tnev ,f 2 {X j . k - 1 ) or Z t = 
/i(X,- i fe_i)ot,»e4 )Iiew /2(Xj ) fc_i) for some functions and / 2 (.). 

b) Conditioned on X^fc-i, all terms that we bound in the proof of Lemma [34] are also of the above form, whenever 
Xj t k-i € Tj t k-x- This follows using item[5](all terms that we bound in the proof of this lemma contain et). 

7) Xj t k-x is independent of any at.* or at. new for t E X/.fe , and hence the same is true for the matrices Z? new , i? n ew, E new , 
D*, D new k-i, $k-x (which are functions of Xj^-i)- Also, Ot,*'s for different t E Ij t k are mutually independent and 
the same is true for a^new's for t E 1-j.k- 

8) Combining the previous two facts, for Lemma [33] conditioned on Xj t k-i, the Z t 's given in item [6] are mutually 
independent. For Lemma 1341 conditioned on Xj k-x, the Z t 's given in item [6] are mutually independent, whenever 

Xj t k-x g r^fc-i. 

9) The assumption that < O.G*^ 1 + 0.4c£ is combined with Fact [29] to get simple expressions for the probabilities 
with which the bounds hold. 

10) The statement "conditioned on r.v. X, the event £ e holds w.p. one for all X E T" is equivalent to "'P(£ e \X) = 

1, for all X ET". We often use the former statement in our proofs since it is often easier to interpret. 
Proof of item [4] Ck-x < Cjt—x foll° ws from the definition of Tj^-x^ Also, the definition implies that £x,* < r oC and 
Cj',K < Ck f° r a ^ f — J ' — !• Using the definition of If from Theorem [T8l and using the assumption on Cjf, this implies that 
Ci'.K < 0.6 K + 0.4cC < < for all / < (j - 1). Using Remark[22] this imphes that C* < r ( + U ~ l ) c C = Ct- 

Proof of item [5] Xj t k-i € Yj^-x implies that £f.-x < Ck-X C* < C*~ = r o + (j — 1)C This follows using item [4] By 
assumption, Ck-i — 0-6 fe_1 + QAcQ and the four conditions of Theorem [T8l hold. Thus, conditioned on Xj^-i, all conditions 
of Lemma [30l hold as long as Xj^-x G Ij - >fc— !• Applying Lemma [30l (i) T t = T t for all i e %j,k', and (ii) for this duration, 
et satisfies ([8]), i.e. the claim follows. 

C. Proof of Lemma \33[ High probability bounds on A m i n (j4.fc) and X max (Ak t ±) 
Proof: 

In this proof, we frequently refer to items from the previous subsection, i.e. Sec. IVII-BI 

Consider A k := ^Y,t E ^'^o L tL t '^oE mw . Notice that E ne J<& L t = R nev/ a t ^ evj + E m J D*a t ^. Let Z t = 

-^new ( 2'£,new'2'i,riew ^new let Y\ ^new^£,new^£,* ^* -^new ~r~ ^new ^*^i,*^t,new ^new » then 

A fe h - y> t + - (B) 

t t 

Consider Z t = J2 t -R n ew a t,newat,new'-Rnew ( a ) Using item [8] of Sec. IVII-Bl the Z t 's are conditionally independent given 
Xj,k-i- (b) Using item[7] Ostrowoski's theorem (Theorem[9), and item[4] for all Xj^-x € rV,k-i> A ro j n (E(i Z t \Xj^_ 1 )) = 

A m in(-Rnew^ X)t E(at,newat,new')^new') > A m i n (i? n ew-Rnew')^min ( — X) t ^( a « ,newat,new') ) > (1 — (C* ) 2 )\ew,fe- ( C ) Finally, Using 



Thus, applying Corollary [T3l with e = , we get 
P(W£ E > (1 - (C + ) 2 )A MWlfc - ^-l^-i) > 1 - cexp(- ,"^\%i fc , ,J for all .Y,, , e 



items [3] and [TJ conditioned on Xj^-i, ^ Z t X C7,^ ew fc 7 < cmax((1.2) 2fe 7n ew , 7^)/ holds w.p. one for all Xj^-x € r^fe-i. 

_ cCA" 
— 24 



a— 24 1 ^ 8-242.min(1.2«74 w)7 4). 
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Consider Y t = -RnewOt.newat,*' D* E nevl + ^new'f*at,*at,new'-Rnew'- (a) Using item[8] the Y t 's are conditionally independent 
given X jtk -i. (b) Using items and Q] E(± Et *tl^j,fc-i) = for a11 -^j.fe-i € r j,fe-i- ( c ) Usin S items [TJ [3] g] and Fact[29] 
conditioned on Xj t k-i, \\Y t \\ < 2 v / cr('+7 >t 7 n e„.fc < 2y / cr£+7;f < 2 holds w.p. one for all Xj^-i G Tjk-i- Thus, under the 
same conditioning, —6/ ^ with & = 2 w.p. one. Thus, applying Corollary 1131 with e = ct ^ , we get 

P(A min (i^>) > -^1X^0 > l-cexpe ^y^p ) for all X^-i G (15) 

Combining (O, (O and (Q~5]) and using the union bound, P(A min (A fe ) > A~ wfc (l - (Ct) 2 ) ~ ^-l^fe-i) > 1 - 
p a (a, C) for all X^-i G F^fe-i. The first claim of the lemma follows by using A~ w fe > A - and then applying Lemma [TTI 
with X = Xj : k-i and C = Tjk-i. 

Now consider A k ,± := ± Et E mw,-L®o L tLt$oE m! w,x ■ Using item [3] E neWiX '^ L t = E neWt± ' 'D*a ti ,. Thus, A fe . ± = 
i E t Z t with Z t = E nevl ± D*a t *a t y Dj E nevl ± which is of size (n — c) x (n — c). (a) As before, given X,j k _i, the Z t 's 
are independent, (b) Using items [4] [JJ and Fact [29] conditioned on Xj^-i, -< Z t -< r {(,t) 2 l*I d± CI W P- one f° r all 
X jjk -i G Tj-fc.!. (c) Using items 012 E(± Et ^fe-i) ^ (C + ) 2 A+I. 

Thus applying Corollary [13] with e = , we get 

P(A max (A fc a) < (C) 2 A+ + > 1 - (n - c) exp(- QC 8 2C2 2 ( 4 A 2c )2 ) for all € T,- 

The second claim follows using A~ k > A~ and / = A + /A~ in the above expression and then applying Lemma [TTI ■ 



D. Proof of Lemma \34\ High probability bounds on ||Hfc|| 2 

Proof: In this proof, we frequently refer to items from Sec. IVII-BI 
The first claim of the lemma follows using item [5] of Sec. IVII-BI and Lemma [TTI 
For the second claim, using the expression for H k given in Definition [24] it is easy to see that 

||Wfc|| 2 <max{||iTfc|| 2 , \\H k ^\\ 2 } + \\B k \\ 2 < \\- Ve t e t '|| 2 + max(||T2|| 2 , ||T4|| 2 ) + ||B fc || 2 (16) 

a i — d 

where T2 := \ Et E nev ,'$> (L t et' + e t L t ')$> E mv> and T4 := i Et E nev ,,±'^o(L t et' + e t 'L t )$> E mWi± . The second inequality 
follows by using the facts that (i) H k = Tl - T2 where Tl := i Et E nev ,'$oe t et'<f>oE Bew , (ii) H k ^ = T3 - T4 where 
T3 : = ^E t £new,_L'$oe t e t '$ £new._L, and (iii) max(||Tl|| 2 ,||T3|| 2 ) < ||i Et e t e t' Ha- Next, we obtain high probability 
bounds on each of the terms on the RHS of ( fToT ) using the Hoeffding corollaries. 

Consider || — E t etet'|| 2 . Let Z t = e t et ■ (a) Using item [8] conditioned on Xj t k-i, the various Zt's in the summation 
are independent, for all Xj^-i G Tj t k-i. (b) Using items [JJ [2] [4] conditioned on Xj^-u -< Z t di b±I w.p. one for all 
X/,fc-i G r j>fc _i. Here h := (K+Cti<£ + Vc7n e w,fc + Q+4> + ^l*?- (c) Using items [JJ SI d ^ Et E (^l^,fc-i) ^ 
& 2 I, & 2 := (^) 2 (C_ 1 ) 2 (0 + ) 2 A+ W!fe + (C+) 2 (0 + ) 2 A+ for all X j<k -x G I^. 

Thus, applying Corollary [T3l with e = c ^ , 

P(ll^ E e * e *'H2 < fo 2 + ^-|A>, fc _0 > 1 -ncxp(- ^ 2C 2 2 ^ )2 ) for all G r Afc _i (17) 

Consider T2. Let Z t := E^'^oiLtet' + e t Lt)$oE n&v/ which is of size c x c. Then T2 = ^Et 2 '*- ( a ) Usin S item [8] 
conditioned on Xj k-i, the various Zt's used in the summation are mutually independent, for all Xj^-x G Tj^-i. Using items 
[2]and[3] E new '^> L t = R Bev/ a t , aevi + E new ' D*a t .* and E mw '^ e t = (R B tm')~ 1 Dj XSW 'e f . (b) Thus, using items [2] [3] [H [JJ it follows 
that conditioned on Xj tk -i, ||Z t || 2 < 2b 3 < 2b 3 w.p. one for all Xj^-i G T j>k -x. Here, 6 3 := -j=i==(f) + (k+ (£_ iy /cj new M + 

VrCtl^i^ln^.k+VrCtl*) andb 3 ■= T ^ + „ (0 +CK ^ 2 C-l7new fc+<^ + V^ K ^ 2 C_iC + 7new,fc7* +4> + V^C«j"C + 7*7new,fe + 

V 1-(C« ) 2 

4>+rC+\ 2 ). (c) Also, ||^E t E(^l-Xi,*-i)lla < ^ < 2& 4 where 6 4 := "° + + 77=Trl ^ + (C + ) 2 A+ 
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and 64 := - 7 =% ? =0+ K +C+_ 1 A+ w + 1 0+(C+) 2 A+. Thus, applying Corollary d with e = ^1 



'- v v- «c 2 C 2 (A-) 

^3 



P(||T2|| a < 26 4 + \Xj,k-i) > 1 - cex P(- 32 , 242,4^2 ) for a11 e r iifc _! 



Consider T4. Let Z t := E new .±'&o(Lte t ' + e t L t ')$oE nsv/ ,± which is of size (n— c) x (n— c). Then T4 = ^ Et Z t . (a) Using 
item|8] conditioned on Xj t k—i, the various £>t's used in the summation are mutually independent, for all Xj >k -i G Tj <k -x. Using 
items HIE] ^new.x'^oi* = E nevii ±' D+at,*. (b) Thus, conditioned on X^-i, ||Z 4 || 2 < 2& 5 w.p. one for all Xj^-\ G ^j,k-i- 
Here 65 := </> + KC) 2 7* 2 + + V^>4C + C-i7*7new,fc This follows using items H Q] (c) Also, || ± £ t E(^ t |X ijfc _i)|| 2 < 
2& 6 , fe 6 :=0+(C+) 2 A+. 

Applying Corollary [l4l with e - 



24 ' 

< X ~ IV . \ \ 1 ^„„„, «c 2 C 2 (A-) 2 
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P(||T4|| 2 < 26 6 + -1_|A- J , fc _i) > 1 - (n-c)exp(- 32 ^ for all X i>fc _i G r i)fe _j 



Consider max(||T2|| 2 , ||T4|| 2 ). Since 63 > fe 5 (follows because C^_i < 1) and 64 > b 6 , so 2fo 6 + < 2& 4 + and 

ac 2 f 2 ( \ - ) 2 ac 2 c 2 (\ — ) 2 

1 - (n - c) cxp( — 8 . 24 2, 4b a ) > 1 - (n - c) exp( — 8 . 24 2. 4b i )■ Therefore, for all X,-, fe _i G T jtk -i, 

P(||T4|| 2 < 26 4 + > 1 - (n - c) expt- ^f/*"^ ) 



24 x / v / ^ 32-24 2 -46l 



By union bound, for all Xj >k _i G r^-i, 



P(max(||T2|| 2 , ||T4|| 2 ) < 26 4 + ^l^-i) > 1 - n exp(- gf^Ajj ) (18) 

Consider ||-B fc || 2 . Let Z t := -E n ew,_i_'$o(£t - e t )(i t ' - e t ')^oE aevl which is of size (n - c) x c. Then B fc = ~ Et Z t . (a) 
Using item [HJ conditioned on X^ k -\, the various Z t 's used in the summation are mutually independent, for all Xj^-i G 
r^fc-i- Using items |2[3l E Deviy± '$ (Lt ~ e t ) = S ne w,x' (D*<H,* - $oe*)> E n<sv ,'$ Q (L t - et) = R nev ,a t: „ evj + E new 'D»at,» + 
(Kew) -1 - ^ 6 *- ( b ) Thus ' conditioned on Xy ife _i, ||Z t || 2 < 67 w.p. one for all G r j - ifc _ 1 . Here 67 := (V^C + (1 + 

^ + )7, + (K+)C A t 1 + Vc7new,fc)(Vc7new,fc + V^C + (l+ ^^j-^ K t<t> + h* + T^Z^S K t 2 (k-i<t> + V^Ynew.fc)- This follows using 
items 12 E] gl ID (c) Also, ||± Et E(Z t |X/ lfc -i)|| 2 < &8 where b 8 := + -/^S (^) 3 (C- 1 ) 2 (0 + ) 2 )A+ w , fc + 

(C+) 2 (l + 0+ + ^=4— k+^+ + -1_ K +(0+) 2 )A+ for all X j<k ^ G r i)fc _i. Thus, applying Corollary El with e = S^L, 



P(||S fc || a < b s + C ^-\x hk ^) > 1 - nexp(- Q 3 C 2 C 2 ( 4 A 2 j ) for all X^ k _ x G r jjfc _i 

Using ([T6J, (EJ, CI and O and the union bound, for any Xj^—i £ Tj.fc— i? 

, . , ac 2 C 2 (A~) 2 N . ac 2 C 2 (A-)\ , ac 2 C 2 (A-) 2 e 2 , 

P(llm|| 2 < b 9 + -VlX^O > 1 - nex P (-^A^) - nex P (- 32 ^ - nex P (- Al^ , 

where b 9 := & 2 + 26 4 + & 8 = (( 2 / ^l + K ^ + )C fe + _i + ((4) 2 (0 + ) 2 + ^-£2i££)(C_ 1 ) 2 )A n + ew . fc 



(19) 



2(j)+ 



+ 1 + 0+ + - sy + y )(C) 2 A+ 



+((<rr + 

= C(C+_ i; C+)A+ ew>fe + 0(C+, C/)A+ (20) 

where C(x;u, u) and 0(u,v) are defined in Definition [32 Using A~ wJ . > A~ and / := A + /A _ , b 9 + < 
^new fcftnc(Cfel-i) C*"> C*" /> c 0- Using Fact [29] and substituting nf = 0.15, + = 1.2, one can upper bound 61, 63 and 6 7 
and show that the above probability is lower bounded by 1 — p c (a, £). Finally, applying LemmaQTl the result follows. ■ 
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E. Exponential decay of the bounds on Q t k 

Lemma 37 (Exponential decay of Pick ( as given in Theorem IT8l Assume that the four conditions of Theorem [T8l 
hold. Define the series ( k + as in Definition [36] Then, 

1) £+ = l, £+ < < 0.5985 for all k > 1. 

2) < 0.6 fc + 0.4cC for all k > 

3) fl«fac(Cfc I C* + , C* + /, cC) > 3dec(0.596; 10" 4 , 1.5 X lO" 4 , lO" 4 ) > for all k > 1. 

Proof: Conditions 2, 4 of Theorem [T8l imply that k 2;s ,* < Kga , = 0.3, «2s,new < K 2s new = 0.15, K2s,fc < kJ s = 0.15, 
n s ,k < n+ = 0.15 and g. ]M < g+ = y/2. Using Lemma |28l this implies that cj) k < 4> + = 1.1735. Using Fact|29l (+ < 10~ 4 ; 
C + / < 1-5 x 10~ 4 ; and c( < 10~ 4 . 

1) By definition, £q = 1. We prove the first claim by induction. 

. Base case: For k = 1, Ci + = /mc(l; C* + , C* + /, <) < /inc(l; 10~ 4 , 1.5 x lO" 4 , lO" 4 ) < 0.5985 < 1 = C + - 

• Induction step: Assume that Ck—i — (k-2 f° r k > 1. Since fi nc is an increasing function of its arguments, ( k = 

finciCk-l'i C?~i C~/i C C) < finc(Ck-2> C*~i Cf/i c C) = Cfe-1' 

2) For the second claim, let 6 a (x;u,v,w) := ; g ^(^v,w) ^de b (x;u,v,w) := ^^f^j^T^- Then, / mc (x; u, w, w) = 
a (a;; u, u, w)x + 0(,(a;, it, u, w)c£. 

• Notice that 8 a , 9b are also increasing functions of all their arguments. Thus, a (Cfe_ii C*" ; C^-A C C) < 
<9 a (0.5985;10- 4 ,1.5 x lO^lO" 4 ) « 0.4471 < 0.6 and ^(C^_ x ; Cf7> O < ^(0.5985; 10" 4 , 1.5 x 
10~ 4 , 10~ 4 ) = 0.1598 < 0.16. Thus, 

< 0.6^^1" + (0.6 fe ~ 2 + 0.6 fc ~ 3 + • • ■ + l)0.16cC 

Ifirf 

<0.6 fc + - - = 0.6 fc + 0.4cC (21) 

1 — 0.6 

3) Since ( k < 0.5985 and gd ec is a decreasing function of its arguments, gdec(( k ', C~i C*~/> C C) > 5dec(0.5985; 10 -4 , 1.5 x 

io- 4 ,io- 4 ) > 0. 



F. High probability exponential decay of Q t k 

Definition 38: Define the event Tj k for k = 1, 2 . . . K + 1 as 

f {Cj.fe <(£,f t =T t and e t satisfies © for all i G X,- )fc } if 1 < jfe < if 
\ {f t = T t and e { satisfies © for all * € £j, fc } if fc = K + 1 

Remark 39: Recall that the event r| fc is defined in Lemma [33] as follows. 

q fc := {&,♦ < r C; 0'.*' < Ct> for all / = 1, 2, . . . j - 1, k' = 0, 1, . . . K; Q tk , < {+, for all k' = 0, 1, . . . k} n 
{T t = T t and e t satisfies © for all t <tj + ka- 1} 

It is easy to see that q fc = D f| >fc for all 1 < fc < X and r| +1>0 = n f^ +1 . Thus, T* jfk = n ■ • • n f- fc 

r^ +li0 = q n (nf+/q fc ) = rf j0 n n^^n^q,^). 

Lemma 40: Pick ^ as given in Theorem [T8l Let := (r + (j — l)c)C and let be as defined in Definition [36] Also, 
let Pfe(a, C) be as defined in Lemma [35] and let the events 1^ k and fc be as defined above in Definition [38] and Remark [39] 
Assume that the four conditions of Theorem 1X81 hold. Also, assume that V(T e - k _ 1 ) > for all 1 < k < K + 1. Then, 

1) Cfe < 0.6 fc + 0.4cC for all < k < K, 

2) P(f|, fe |r| ifc _ 1 ) > p k {a,() for all 1 < k < K and 

3) p(f u+h,K) = i- 
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Proof: By Lemma[33 denned in Definition[36]satisfies < 0.6 fe +0.4cC for all k < K and g de c(Ck 5 C C/, <) > 0. 
Thus, we can apply Lemma [35] and Lemma [34] By Lemma [35] P(Cfe < tk\ T j,k-i) > Pk(a,()- B Y Lemma [34] P({f t = 
T t and e t satisfies © for all t G 2,-,*}|r| fe _ 1 ) = 1. Combining these two facts, P(f | fe |r^ >fc _ 1 ) > p k (a, Q for all 1 < k < K . 

Since T e j K holds and since (+ < 0.6 k + 0.4cC for all k < K, thus C* < Ct and (k < (k < °-® K + °- 4c C This is proved 
in Sec. IVII-Bl (iteml4l). Using this and applying Lemma [30l the last claim follows. ■ 

VIII. Proof of Theorem|T"81 

1) By the assumption that ||(7 - P P^)P || < r C, P({Ci,* < Ci,*}) = l - B Y Lemma 1301 Ci,* < C^* implies that f t = T t 
for all t tmin < t < t x - 1. Thus, P(rf )0 ) = 1. 

2) Recall that T e jk = n f? fc for all fc > 1 and T e j+lfi = q o n (nf+^f^). Thus, 

P(r?+i,o) = P(q )nfJi 1 P(q fc iq ,q 1 ,...q fc _ 1 ) = P(q )nt + i 1 P(q fe |q fc _ 1 ). Thus, P(r| +li0 ) = 
P(rf, )n^ = int + i 1 P(q, fc |q, fc _ 1 ). 

3) Since P(rf ) = 1 > and pk(a, C) > for all k, we can apply Lemma l40l for every k and f starting with k = = 1. 
Thus, by Lemma l40l P(T e r i 1 ) > (Y[^=iPk(a,C)) J ^ (pk((x, Q) KJ ■ The last inequality follows because pk > Pk- 

4) Now, 

a) T e J+1 implies that (i) f t = T t and e t satisfies (O for all t < t J+1 \ (ii) Q <k < for all k < K, j < J. 

b) By Lemma [37] < 0.6 fc + 0Ac(. Thus, T} +1 implies that Ci,* < r ( and Q,k < 0.6 fc + 0.4cC for all j < J, 
k < K. Using the definition of K, this means that Q : k < c( for all j. By Remark [22] all this implies that for 
t G X j>k , SE t < (,> + 0,fe-i < (no + (j - l)c)C + 0.4cC + 0.6 fe - 1 , and for t e K+1 , SE t < SE jiK < (r +jc)(. 

c) Combining the previous two conclusions and using Fact [29] E e J+1 implies that the bounds on || 1 1 2 hold. 

5) Since P(Tj +1 ) > (pK(a, C)) KJ > a U °f the above hold w.p. at least {px{a, Q) KJ ■ Using the definition of a a dd, 
(pk (a, C))^' 7 ^ 1 ~ n -10 whenever a > a at jd- Thus the above conclusions hold w.p. at least 1 — nT 10 . 

IX. Extensions 

In Sec IIX-AI and Sec IIX-BI we show how ReProCS and its performance guarantees can be extended to the (a) the missing 
data (measurements) case and (b) the undersampled projected measurements case. In Sec IIX-CI we introduce a more general 
subspace change model and give the performance guarantees for it. In Sec IIX-D1 we describe the key idea of ReProCS with 
deletion which will improve performance for data satisfying the more general model of Sec IIX-CI 

A. Extension to Missing Data Case 

Consider the following problem. The goal is to recover a sequence of L t 's that lie a slowly changing low dimensional 
subspace from measurements M t , when some of the measurements may be missing. To be precise, let T t denote the set of 
missing measurements at time t. Then, M t is a sparse vector with support T f c , i.e. 

(M t ) T c = (i t ) T c, (M t ) Tt =0 

and L t follows the model of Sec IIII-AI We do not assume any model on T t , except an upper bound on its size. In particular, 
it could be correlated over time. This problem is related to the low rank matrix completion problem ||32l . II331 . 

As explained in earlier work [5 1, any solution for low dimensional signal recovery in the presence of sparse outliers can be 
converted into a solution for low rank matrix completion by re-formulating the problem as M t = L t + St where (St)T t c = 
and on Tt, St can take on any value. Since we let (Mt)r t = 0, this means that we are letting (St)T t = — (it)r t - Notice that 
this missing data problem is actually easier because Tt is perfectly known. 

To recover L t from missing data, we can use the ReProCS algorithm of Algorithm [1] with the following simple change: 
remove steps lb and lc and replace them by T t = T t . 

The performance guarantees given in Theorem [18] also apply to this case with the following simple changes. In condition 1, 
the parameters £ and uj are not needed; and in condition 3, one can remove the lower bound on S m i a - Also, the conclusions 
for recovering L t , i.e. the bound on \\L t — Lt\\2 are the only ones relevant in that case. 
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Algorithm 2 ReProCS for undersampled measurements 
Implement Algorithm Q] with the following changes. 

1) In step lb (sparse recovery step), replace $>n) by $(j)A 

2) In step Id (LS step), replace $( t ) by $( t )A 

3) In step 2 (estimate Lt step), estimate as Lt = Mt — ASt- 



B. Extension to the Under-sampled Measurements Case 

In Theorem [18] we obtained performance guarantees for recovering a sparse vector from full, but highly corrupted 
measurements of the form M t := St + L t where St is the sparse vector and L t is the large but low dimensional "noise". 
Alternatively, if Lt is the low dimensional signal of interest (robust PCA problem), then Theorem [18] gives performance 
guarantees for recovering it from large but sparse "noise" (outlier), St- We showed that both St and Lt can be accurately 
recovered using ReProCS (Algorithm Q3 under mild assumptions. 

In this section, we consider the under-sampled measurements case, i.e. we would like to recover St from 

M t := AS t + L t 

where A is a given fat measurement matrix. To recover St and L t , we can use the modification of ReProCS given below 
in Algorithm [2] We should point out that, as far as recovering St is concerned, the same algorithm will also extend to the 
situation where M t = ASt + AL t . The reason is that if L t lies in a slowly changing low dimensional subspace, then the same 
is true for L t — AL t . The difference will be that we will only be able to recover L t , but not the actual L t . 

To obtain performance guarantees for recovering St and L t from Mt := ASt + Lt using Algorithm [2] we can proceed in 
a fashion analogous to what was done for the full sampled case. The most important difference in this case is to replace the 
denseness coefficient n s (B) by the following 

,m \\At'B\\ 2 

T:\T\<s \\B\\ 2 

Notice that It in the definition of k s {B) has got replaced by At now. Also k s>a (B) is no longer just a denseness coefficient. 
It should now be interpreted as a measure of coherence between the span of any At with |T| < s and span(B). Using this, 
we can get the following result. 

Corollary 41: Consider Algorithm [2] for recovering St and L t from M t := ASt + L t . Assume that L t obeys the model 
given in Sec. IIII-AI and there are a total of J change times. Assume that the initial subspace estimate is accurate enough, i.e. 
— P Pq)P || < r oC with C chosen as in Theorem [T8l Define K(£), £o , p, a a dd(C) as m Theorem [T8l If 

1) algorithm parameters are set as £ = £oj 7p£ < lo < 5 m j n — 7p£, K = K(Q, a > a a dd,A = 1.05aadd(C)> 

2) the matrices A, Pj, P j: „ evj , D jitt)k := (I - P^ 1 P^_ 1 - Pj.new.fc-Pj'new.fc)-^'-!- ^.new.fc := (I - Pj-xP'j-x - 

Pj,neMj,kP'j, ns:V j,k)Pj,nevi ^nd Qj im w,k '•= {I — Pj,nev/Pj,aew )Pj,aew,k (recall that Pj^ew.O = [■]) Satisfy 

6 S (A) < 0.05, 

K2s,A(Pj-l) < K+ A , maXK2s,A{Pj,new) < K+ wA , 
3 

max max K2 S ,A(-Dj,new,fc) < max max n 2s ,A{Qj.^,k) < H^ A , 

3 0<k<K j 0<k<K 

max max K2s a(Dj * k) < 1 (22) 

j 0<k<K 

for constants A , K^ ev/ A , A , A small enough, 

3) for a given value of S m [ n , the subspace change is slow enough, i.e. condition 3 of Theorem [T8l holds, 

4) the condition number of the average covariance matrix of at >ne w 1S small enough, i.e. condition 4 of Theorem [TBI holds, 
then all conclusions of Theorem[T8lfor T t , e t := St — St and SE/ t ) hold. The bound on \\L t — Lt\\2 is multiplied by (1 + S S (A)). 
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The changes in the above result w.r.t. Theorem [18] are in condition 1 and condition 2. In condition 1, a a dd.A is slightly larger 
than a al jd used in Theorem [T8l In condition 2, (i) we need to bound S S (A); (ii) we need to assume ft2s,A(Pj,*,fc) < 1 and (iii) 
the values of K* A , k^, w a , kJ a , kJ a need to be smaller than the bounds used in Theorem [T8l 

We explain why these changes are needed in Appendix iGl 

C. More General Model for L t 

The model given in Sec IIII-Al only allows for new directions to get added to Pj-i, but not for any directions to get removed. 
This means that the rank of the subspace in which Lt lies keeps increasing. However, in practice, this may not happen. Consider 
the following more general model. Assume the model of Sec IIII-Al with the following changes. 

1) We assume that Pj = [Pj-i P j, new] \Pj, old where Pj.oid contains e,- i0 id columns of Pj-i- Thus Tj = r,_i + Cj iIlew — c J)0 id- 

2) Assume that there exists a constant c mx such that < Cj. ne w < c mx , and X)i=i( c i,new — Ci.oid) < c mx . This ensures that 
the subspace rank in any period, rj := rank(P i ) = r + X/Li( c *,new - Q.oid) < r + c mx := r mx . 

Even for this more general model, the ReProCS algorithm of Algorithm[T]works. The only difference is that Pj is now actually 
an estimate of [Po, Pi.new, ■ • ■ Pj,new] and not of just Pj. However, since Pj is a sub-matrix of this bigger matrix, thus, span(Pj) 
still approximately contains span(Pj) in the sense defined in Definition [2] Similary at any time t, Pn) is an estimate of the 
span of a bigger matrix than Puy However, span(PV t )) still approximately contains span(P( t )). 

For the above model, the performance guarantees also remain almost the same. The only change is that we need to bound 
k s ([P , Pi. new, ■ • • P,/-i,new]) instead of k s (Pj_i). The following corollary can be stated. 

Corollary 42: Consider Algorithm Q] Assume that Lt obeys the model given above and there are a total of J change times. 
The result of Theorem [T8l holds with the following change. We also need k s ([Pq, Pi. ne w, ■ • ■ P/-i,new]) < 0.3. 

D. ReProCS with deletion 

Consider the more general subspace change model given in Sec IIX-C1 above. While ReProCS applies directly for this model 
as well, one can further improve its performance by also including a deletion step that we briefly explain here. 

A limitation of ReProCS is that at time f, it obtains an estimate of the subspace spanned by all the columns of Lt 
and projects perpendicular to this subspace in order to approximately nullify L t before solving the li problem to recover 
St- However, according to the more general model, the current L t only lies in span(Pj) which is a smaller subspace than 
span(£ f ) = span([Po, Pi^ew, • ■ • Pj.new])- In other words, rank(Pj) is smaller than rank([Po, Pi inew , • ■ • Pj,nm])> an d so me same 
is true for the denseness coefficients. Thus, if we can get an estimate of only span(Pj), the RIC of $j.k will be smaller, thus 
making the sparse recovery more accurate. This, in turn, will mean that the subspace error will also be smaller. 

One simple way to estimate only span(Pj) is to delete directions as follows. Before the next change time, but after the 
current P-^new nas been accurately estimated, we do a standard PCA step. To be precise, at t = tj + Ka + a^ — 1, we compute 
the EVD of ^ Y^t=t K +Ka M 1 LtL't and retain the tj = r^_i + new — c J id eigenvectors with the largest eigenvalues. Use 
these as the new estimate, Puy It can be shown that as long as /, the maximum condition number of Cov(L t ) for any t, is 
small enough, doing this will give an accurate estimate, Pit), of the current P/ t \ — Pj. Because of this deletion, we will be able 
to relax the denseness requirement significantly. We will only need K s (Pj) < 0.3 instead of k s ([Pq, Pi, neW5 • • ■ P/.new]) < 0.3. 

In OTI . we introduce a generalization of this simple deletion strategy that removes the bound on /, but instead only requires 
that the eigenvalues of the average of Cov(L t ) be sufficiently clustered. Suppose that there are 8 clusters. It achieves this by 
replacing the simple PCA step described above by 9 iterations of projection PCA for clusters. 

X. Model Verification, Practical Parameter Setting and Simulation Experiments 

We first discuss model verification for real data in Sec IX-AI In Sec IX-BI we discuss how to set parameters for ReProCS in 
practice. We describe simulation experiments in Sec IX-CI 
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A. Model Verification for real data 

We experimented with two background image sequence datasets. The first was a video of lake water 
motion. The second was a video of window curtains moving due to the wind. The curtain sequence 
is available at http://home.engineering.iastate.edu/~chenlu/ReProCS/Fig2.mp4 For this sequence, the image 
size was n = 5120 and the number of images, t max = 1755. The lake sequence is available at 
http://home.engineering.iastate.edu/~chenlu/ReProCS/ReProCS.htm (sequence 3). For this sequence, n = 6480 and the 
number of images, i max = 1500. Any given background image sequence will never be exactly low rank, but only 
approximately so. Let the data matrix with its empirical mean subtracted be £/«;/. Thus Cf u u isanx t nmx matrix. We first 
"low-rankified" this dataset by computing the EVD of (l/t maiX )Cf u u£'j u[l ; retaining the 90% eigenvectors' set (i.e. sorting 
eigenvalues in non-increasing order and retaining all eigenvectors until the sum of the corresponding eigenvalues exceeded 
90% of the sum of all eigenvalues); and projecting the dataset into this subspace. To be precise, we computed Pfuii as the 
matrix containing these eigenvectors and we computed the low-rank matrix C = Pf u iiP'f u u£fuii- Thus C is a n x i max 
matrix with rank(£) < min(n, t max ). The curtains dataset is of size 5120 x 1755, but 90% of the energy is contained in 
only 34 directions, i.e. rank(£) = 34. The lake dataset is of size 6480 x 1500 but 90% of the energy is contained in only 14 
directions, i.e. rank(£) = 14. This indicates that both datasets are indeed approximately low rank. 

In practical data, the subspace does not just change as simply as in the model given in Sec. IIII-AI There are also rotations 
of the new and existing eigen-directions at each time which have not been modeled there. Moreover, with just one training 
sequence of a given type, it is not possible to compute Cov(L t ) at each time t. Thus it is not possible to compute the 
delay between subspace change times. The only thing we can do is to assume that there may be a change every d frames, 
and that during these d frames the data is stationary and ergodic, and then estimate Cov(Li) for this period using a time 
average. We proceeded as follows. We took the first set of d frames, := [L 1: L 2 ■ . estimated its covariance matrix 
as (l/d)£i : dC' 1 . d and computed Pq as the 99.99% eigenvectors' set. Also, we stored the lowest retained eigenvalue and called 
it A - . It is assumed that all directions with eigenvalues below A~ are due to noise. Next, we picked the next set of d frames, 
Cd+i-.2d ■= [Ld+i, Ld+2, ■ ■ ■ Lid]; projected them perpendicular to P , i.e. computed d tP = (I— PoPo)Cd+iad; and computed 
-Pi. new as the eigenvectors of (l/d)£i iP £' 1 p with eigenvalues equal to or above A - . Then, P\ = [Pq, Pi, new]- For the third set 
of d frames, we repeated the above procedure, but with Pq replaced by Pi and obtained P%. A similar approach was repeated 
for each batch. 

We used d = 150 for both the datasets. In each case, we computed ro := rank(Po)> an d c mx := maxj rank(Pj new ). For 
each batch of d frames, we also computed a t new := Pj new L t , at,* '■= Pj_ 1 L t and 7* := max t ||at||oo- We got c mx — 3 and 
ro = 8 for the lake sequence and c mx = 5 and ro = 29 for the curtain sequence. Thus the ratio c mx /ro is sufficiently small in 
both cases. In Fig[5J we plot ||at,new||oo/7* for one 150-frame period of the curtain sequence and for three 150-frame change 
periods of the lake sequence. If we take a = 40, we observe that 7 new := maxj max tj .< t<tj+Q ||at inew ||oo = 0.1257* for the 
curtain sequence and 7 new = O.O67* for the lake sequence, i.e. the projection along the new directions is small for the initial 
a frames. Also, clearly, it increases slowly. In fact ||at ne w||oo <• max ( ufc_1 7newi 7*) f° r a H t G Ij.k also holds with v = 1.5 
for the curtain sequence and v = 1.8 for the lake sequence. 

B. Practical Parameter Setting for ReProCS 

The ReProCS algorithm given in Algorithm Q] uses knowledge of tj,r ,Cj Mevj from the model and it has four parameters 
£, u),a, K that can be set in terms of the model parameters as given in Theorem [T8l However, it is unreasonable to expect 
that, in practice, the model parameters are known. We provide here reasonable heuristics for setting both the model and the 
algorithm parameters automatically. For a vector v, we define the 99%-energy set of v as To. 99(11) := {i : \vi\ > wo.99} where 
the threshold uo.99 is the largest value of \vt\ so that 1 1 f Tb se 1 1 i > 0-99||f|||. It is computed by sorting \vi\ in non-increasing 
order of magnitude. One keeps adding elements to To. 99 until ||wt 99 ||2 > 0.99||u|||. 

The complete algorithm is summarized in Algorithm [3] We pick a = 100 arbitrarily. We let £ = £ t and to = u t vary 
with time. Recall that £ t is the upper bound on ||/3t||2- We do not know fit. All we have is an estimate of (3 t from t — 1, 
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Fig. 3. Verification of slow subspace change. The figure is discussed in Sec IX-Al 



$t-x = (I — P(t-i)P! t _i\)Lt—v We used a value a little larger than || x || 2 f° r £t : we let £ t = 2||/3t_i||2. The parameter uj t 

is the support estimation threshold. One reasonable way to pick this is to use a percentage energy threshold of St )OS EH. In 
this work, we used oj t — 0.5(5i jC s)o.99- 

Let Ai, A2, • • • , At lrain denote the eigenvalues of ^— X)t=i L t L t ' . We estimate ro and A~ as 

fn = max (— — ; 1+1 ), A - = Af n (23) 

1=1,2,--- ,ttrain-l X- 

This heuristic relies on the fact that the maximum normalized difference between consecutive eigenvalues is from A~ to zero. 

We split projection PCA into two phases: "detect" and "estimate". In the "detect" phase, we estimate the change time tj 
and the number of new added directions Cj, ne w as follows. We keep doing projection PCA every a frames and looking for 
eigenvalues above A . If there are any eigenvalues above A - , we let = t — a + 1 and we let Cj. new be the number of these 
eigenvalues. Also, we increment j and we reset k to one. At this time, the algorithm enters the "estimate" phase. In this phase, 
we keep doing projection PCA every a frames until the stopping criterion given in step |3(a)iiB| of Algorithm [3] is satisfied 
(this estimates K). The idea is to stop when k exceeds K m - m and Pj new fc Pj jnew is approximately equal to Pj new fc _ 1 Pj, ne w 
three times in a row; or when k = K max . We pick K m i n = 5,K max — 20 arbitrarily. When the stopping criterion is satisfied, 
we let Kj = k and Pj = [Pj-i, Pj,new,ifj]> an d the algorithm enters the "detect" phase. 

C. Simulation Experiments 

1) Data Generation: The simulated data is generated as follows. The measurement matrix M. t '■= [Mi, M2, • • • , M t ] is of 
size 2048 x 5200. It can be decomposed as a sparse matrix St := [5*1,52, ••■ , St] plus a low rank matrix £ f := \L\,L^''- ,L t }. 
The sparse matrix St := [Si, S2, ■ ■ ■ , St] is generated as follows. 

1) For 1 < t < t tiain = 200, S t = 0. 

2) For t tra i n < t < 5200, St has s nonzero elements. The initial support To = {1,2, ...s}. Every A time instants we 
increment the support indices by 1. For example, for t G [ttrain + 1, Strain + A — 1], T t = T , for t G [t trd \ n + A, t tra ; n + 2A — 1]. 
Tt = {2, 3, . . . s + 1} and so on. Thus, the support set changes in a highly correlated fashion over time and this results 
in the matrix St being low rank. The larger the value of A, the smaller will be the rank of St (for t > t t rain + A). 

3) The signs of the nonzero elements of St are ±1 with equal probability and the magnitudes are uniformly distributed 
between 2 and 3. Thus, 5 m j n = 2. 

The low rank matrix £ t := [Li, L%,--- , L t ] where L t := P(t) a t is generated as follows: 

1) There are a total of J = 2 subspace change times, ti = 301 and t% — 2501. Let U be an 2048 x (ro + ci. new + C2, n ew) 
orthonormalized random Gaussian matrix. 

a) For 1 < t < ti — 1, P( t ) = Po has rank ro with Pq = E/ri 2,— ,r l- 

b) For ti < t < t 2 - 1, P( f ) = Pi = [Po Pi,new] has rank r x = r + ci !new with P^ new = U[ ro+1> ... ,r + ci , m ]- 
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Algorithm 3 ReProCS (practical) 
Input: M t , Output: S t , L t , Pu). 

Initialization: Given training sequence [L U L 2 ,--- ,L ttrain ], compute the EVD of Y%*[ UU = EKE' and then 

estimate fo and A - using d23l) . Let Po retain the eigenvectors with the fo largest eigenvalues. 

At t = ttrain> let Pu) «— Po- Let j <— 0, k <— 1, tj = t tra ; n + 1 and flag detect. For t > ttr a ; n , do the following: 

1) Do step 1) of Algorithm Q] but with £ and uj replaced by £ t and u t computed as explained in Sec IX-BI 

2) Do step 2) of Algorithm [T] 

3) Projection PCA: Update Pn\ as follows. 

a) If t = 1, +ka-l, compute EVD of 1 E^^-i)^ " P j-i P i-i) L tm ~ Pj-lPj-i) 

i) If flag — detect, 

A) If no eigenvalues are above A - , then Pu\ <— Pu_iy Increment k <— k + 1. 

B) If there are eigenvalues above A - , then tj 4— t — a + 1, j ■<— j + 1, k 4— 1, /Zag estimate. 

ii) Else if /Zag = estimate, 

A) Let P/.new.fc retain the eigenvectors with eigenvalues above A - , P/ t ) 4— [P,--i Pj, new, fc] and fc •<— k + 1. 

B) If if k > K min and 11 ^«-°+i^'^-'-^"".--\7 P ^"»'-' p ^"».' )I ' t|12 < o.Ol for i = fc-2, fc; or k = if max , 

then ifj 4- fc, Pj -s— [Pj-i P^ new if ] reset fl> a 9 ^~ detect. 
Else (i 7^ + fca — 1) set Pm <- Pt-x)- 

4) Increment t ^ — t + 1 and go to step 1 . 



C) For t > t 2 , P( t ) =Pl = \P\ p2,new] has rank r 2 = n + C 2 ,„ew With P 2j „ ew = ^[r +ci. n , w + l,--- ^o+d^+c^]- 

2) at is independent over t. The various (at)j's are also mutually independent for different i. 
a) For 1 < t < t\, we let (ot)j be uniformly distributed between —7^4 and 7^, where 



7i,t 



400 


if?: = 


1,2,-- 


• ,r /4,Vt, 
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if?: = 




l,ro/4 + 2,... 


,r /2,Vt. 


2 


if i = 


ro/2 4 


l.ro/2 + 2,-.. 


,3r /4,Vt 


1 


if i = 


3r /4 


4-1,^/4 + 2, 


• • ,ro, Vt 
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b) For ti < i < i 2 , a^* is an ro length vector, at, new is a ci, ne „ length vector and L t := P(t) a t = Pi a* = 
P)Ot,* + P.,newCft,new (ctt,*)i is uniformly distributed between —74,4 and 7^4 and a t ne „ is uniformly distributed 
between — 7 ri ,t and 7 ri ,t, where 



7n,t 



l.l fe_1 if ti + (fc- l)a<i <tx+ka- l,k = 1,2,3,4, 

l.l 4 " 1 = 1.331 ifi>ii+4a. 



(25) 



c) For t > t 2 , at,* is an n = ro + ci ne „ length vector, a t ne „ is a c 2jne „ length vector and L t := P(t) a t — Pi&t — 
[PopL,new]a*,* + Pi, new Q>t, new Also, (a tj>t )i is uniformly distributed between —7^ and 7^ for i = 1, 2, • • • , ro and 
is uniformly distributed between — 7 ri ,t and 7 ri . t for ? = ro + 1, . . .r±. a t new is uniformly distributed between 
-7r 2 ,t and 7 r2)i , where 



7r 2 ,t 



l.l k_1 if t 2 + - l)a < t < t 2 + ka - 1, k = 1, 2, • • • , 7, 

l.l 7 " 1 = 1.7716 ift>t 2 + 7a. 



(26) 



Thus for the above model, 7* = 400, 7 new = 1, A + = 53333, A = 0.3333 and / 



1.6 x 10 5 . Also, S„ 



2. 
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Fig. 4. ro = 36, s — maxt \T t \ = 20 and A = 10. The times at which PCP is done are marked by red triangles in (b). 



We used £t la ,„ +A/"t rata as the training sequence to estimate Pq. Here 7Vt lrain = [Ni, N2, ■ ■ ■ , AT tlrain ] is i.i.d. random noise 
with each (iVt), uniformly distributed between — 10~ 3 and 10~ 3 . This is done to ensure that span(Po) 7^ span(Po) but only 
approximates it. 

2) Results: For Fig. [4] and Fig. [5] we used s = 20, r = 36 and ci, new = c 2i „ e w = 1. We let A = 10 for Fig. |4] 
and A = 50 for Fig. [5] Because of the correlated support change, the 2048 x t sparse matrix St — [S\,S2,--- , St] is 
rank deficient in either case, e.g. for Fig. S t has rank 29,39,49,259 at t = 300,400,500,2600; for Fig. S t has rank 
21, 23, 25, 67 at t = 300, 400, 500, 2600. We plot the subspace error SE (t) and the normalized error for S t , ^% t ^ 2 averaged 
over 100 Monte Carlo simulations. We also plot the ratio ^ J i[ f r ,' Pj '' K ' w l f^ 2 at the projection PCA times. This serves as a proxy 

II -^j.new, k II 2 

for K s (Dj iDeWt f c ) (which has exponential computational complexity). In fact, in our proofs, we only need this ratio to be small 
at every t = tj + ka — 1. 

We compared against PCP 0. At every t = tj + 4ka, we solved (fTJ with A = 1/ ^/max(n, t) to recover St and £ f . We 
used the estimates of St for the last 4a frames as the final estimates of St- So, the St for t = tj + 1, . . . tj + 4a is obtained 
from PCP done at t = tj + 4a, the St for t = tj + 4a + 1, . . . tj + 8a is obtained from PCP done at t = tj + 8a and so on. 

As can be seen from Fig. |4] the subspace error SE/ t ) of ReProCS decreased exponentially and stabilized after about 4 
projection PCA update steps. The averaged normalized error for St followed a similar trend. ReProCS(practical) performed 
similar to ReProCS but stabilized in about 6 projection PCA update steps. In Fig. [5] where A = 50, the subspace error SE/ t ) 
also decreased but the decrease was a bit slower as compared to Fig. |4] where A = 10. Also, the ratio ^~p^~^~j]~ was now 
larger. Because of the correlated support change, the error of PCP was larger in both cases. The difference in performance 
between ReProCS and PCP is larger when A = 50. 

For Fig. [6] we increased s to 100 and we used A = 10. A larger s results in a larger ^~jjj~^~fjj^ ( anc l larger /^(-D^new.fc))- 
Thus, the rate of decrease of SEm is smaller than that for the previous two figures. The error of St followed a similar trend. 

Finally, if we set A = 00, the ratio was 1 always. As a result, the subspace error and hence the reconstruction 

error of ReProCS did not decrease from its initial value at the subspace change time. For ReProCS, the average error 

52M = 8A X ltr3 ' The error of PCP was also ver y hi 8 h: 52M £ t =° 2 °oi = 0-43. 

We also did one experiment in which we generated T t of size s — 100 uniformly at random from all possible s-size subsets 

of {1, 2, ... n}. Tt at different times t was also generated independently. In this case, the reconstruction error of ReProCS is 

Woo Eti°2°oi l|f j| St j|^ 12 = 2.8472 x 10~ 4 . The error for PCP was 3.5 x 10~ 3 which is also quite small. 

XI. Conclusions and Future Work 

In this work, we studied the recursive (online) robust PCA problem, which can also be interpreted as a problem of recursive 
sparse recovery in the presence of large but structured noise (noise lying in a "slowly changing" low dimensional subspace at 
all times). We introduced a novel solution approach called Recursive Projected CS or ReProCS that is able to recover both the 
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sparse component, St, and low dimensional component, L t , even when the support set of St changes in a correlated fashion 
over time. Under mild assumptions, we showed that, w.h.p., ReProCS can exactly recover the support set of St at all times; 
and the reconstruction errors of both St and L t are upper bounded by a time-invariant and small value at all times. We also 
showed how the algorithm and its guarantees extend to the undersampled measurements' case. 

In ongoing work I13T1 , we are developing and analyzing the ReProCS with deletion approach for the more general model 
given in Sec IIX-C1 Open questions to be addressed in future work include theoretically studying (a) the correlated at's case; 
(b) the connection between the denseness of Dj i0eWt i e and the support change of St (that we observe in simulations); and (c) 
bounding the sparse recovery error even when the support set is not exactly recovered. In the current work, we needed T t = T t 
for t < tj + ka — 1 to ensure that the et's for t g Xj^ are conditionally independent given Xj : k-i = [ffli, a 2 , ■ ■ ■ <h-+(k-i)a-i]- 

Appendix 

A. Proof of Lemma \W\ 

Proof: Because P, Q and P are basis matrix, P'P = I, Q'Q = I and P'P = I. 
1) Using P'P = I and ||M||| = \\MM'\\ 2 , \\ (I - PP')PP'\\ 2 = - PP')P\\2- Similarly, ||(7 - PP')PP'\\ 2 = 
|| (7 - PP')P\\ 2 . Let £>i = (/- PP')PP' and let D 2 = (I - PP')PP'. Notice that \\Di\\ 2 = ^/X m ^( D 'i D i) = 
y^D[L\f 2 and ||L> 2 || 2 = y/\ max (D' 2 D 2 ) = ^J\\D' 2 D 2 \\ 2 . So, in order to show ||£>i|| 2 = HA2II2, it suffices to show 
that H-DiDiHz = ||.D 2 .D 2 || 2 . Let P'P S = D UT,V. Then, D[D 1 = P(I - P'PP'P)P' = PU(I - Y?)U'P' and 
D' 2 D 2 = P(I - P'PP'P)P' = PV(I - Y?)V'P' are the compact SVD's of D' l D l and D' 2 D 2 respectively. Therefore, 
= \\D' 2 D 2 \\ 2 = ||/-S 2 || 2 and hence \\{I - PP')PP'\\ 2 = \\{I - PP')PP'\\ 2 . 
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2) \\pp' - PP'h = \\pp - PP'PP' + PP'PP' - PP'h < W - PP')PP'h + W - PP')PP'h = 2 C*- 

3) Since Q'P = 0, <hsn\\Q' P\\ 2 = \\Q'{I - PP')P\\ 2 <\\{I - PP')P\\ 2 = 

4) Let M = (I - PP')Q). Then M'M = Q'(I - PP')Q and so a t {{I - PP')Q) = \J \i(Q'(I - PP')Q). Clearly, 
A max (Q'(/ - PP')Q) < 1. By Weyl's Theorem, A min (Q'(J - PP')Q) > 1 - X max (Q'PP'Q) = 1 - \\Q'P\\j > 1 - C 2 - 
Therefore, y/1 - Q < er 4 ((7 - PP')Q) < 1. 



B. Proof of Lemma 1771 

Proof: It is easy to see that P{B e ,C e ) = E[I B (X, Y)I C (X)\. If E[I B (X,Y)|X] > p for all X £ C, this means that 
E{l B (X,Y)\X]I c (X) > pI c [X). This, in turn, implies that 

P(B e ,C e ) = E[I B (X,Y)I C (X)] = E[E[I B (X,Y)|X]I C (X)] >pE[I c (X)]. 

Recall from Definition g] that P(S e |X) = E[I B (X,Y)|X] and P(C e ) = E[I C (X)]. Thus, we conclude that if P(B e \X) > p 
for all X g C, then P(B e , C e ) > pP(C e ). Using the definition of P(B e \C e ), the claim follows. ■ 



C. Proof of Corollary 1731 
Proof: 

1) Since conditioned on X, the Z t 's are independent, the same is also true for Z t — g(X) for any function of X. Let 
Y t := Zt — E(Z t \X). Thus, conditioned on X, the Y t 's are independent. Also, clearly E(lj|X) = 0. Since for all 
X eC, P(6iJ <Z t < b 2 I\X) = 1, thus 61/ X E(Z t |X) ^ 6 2 / for all X eC. Therefore, P(Y t 2 < (b 2 - &i) 2/ l^0 = 1 
for all X eC. Thus, for Theorem [TJ o- 2 = || E 4 (&2 - &i) 2 ^|| 2 = a(b 2 - h) 2 . For any X £ C, applying TheoremE] 
for {Yt}'s conditioned on X, we get that, for any e > 0, 

P(A max (- YY t ) < e|X) > 1 - nexp(- — ) for all X g C 

a ^— ' 8(o 2 — oi)- 2 

By Weyl's theorem, A max (i Et Y t ) = A max (i E t (^ - V(Z t \X)) > A max (i Et Z t ) + KU^Et ~^(Z t \X)). Since 
Amin(i Et ~E(Z t |X)) = -A max (i Et E(Z t \X)) > -6 4 , thus A max (i Et Y t) > Ama4~ Et Z t ) - 64. Therefore, 

P(A max (- V Z t ) < 64 + e|X) > 1 - nexp(- . ^ ) for all X g C 
a ^ 8(62 -01) 

2) Let F t = E(Z t |X) - Z t . As before, E(Y t \X) = and conditioned on X, the Y t '& are independent and P(Y t 2 ^ 
(62 ^ &i) 2 ^|^) = 1 for a H -X" € C. As before, applying Theorem [T2l we get that for any e > 0, 

P(A max (^ £y t ) < e|X) > 1 - nexp(-^^-^) for all X g C 



By Weyl's theorem, X ma ^J2t Y t) = Amax(£ E t (E(Z t |X) - Z t )) > A mta (i £ t E(Z t |X)) + A max (± E t -Z t ) 
A m in(i Et E(^ t |X)) - A min (i Et Z t) >h- A mi „(i Et ^t) Therefore, for any e > 0, 

P(A min (- V Z t ) > b 3 - e\X) > 1 -nexp(- Q ., Q£2 , ) for all X g C 



35 



D. Proof of Corollary \E\ 

^ M' 



M 



Notice that this is an (m + n 2 ) x (ni +n2) 



Proof: Define the dilation of an ni x n 2 matrix M as dilation(Af) 
Hermitian matrix |[T9l . As shown in |[T9l equation 2.12], 

A max (dilation(M)) = ||dilation(M)|j 2 = ||M|| a (27) 

Thus, the corollary assumptions imply that P(||dilation(Z t )||2 < fo 1 1 J\T ) = 1 for all X € C. Thus, P(— b±I < 
dilation(Zi) ■< biI\X) = 1 for all X e C. Using (O, the corollary assumptions also imply that - E(dilation(Z t )|X) = 
dilation(i ^ t E(Z t |X)) ^ b 2 I for all X € C. Finally, Z t 's conditionally independent given X implies that the same thing 
also holds for dilation(Z t )'s. Thus, applying Corollary [Til for the sequence {dilation(Z t )}, we get that, 

P(A max (i^dilation(Z 4 )) < b 2 + e\X) > 1 - (m + n 2 ) <»P(-|j^) for all X e C 
Using (O, A max (i J2t dilation(Z t )) = A max (dilation (i £ 4 ^t)) = lis St ^l' 2 and this S ives the final result ■ 



E. Proof of Lemma 1761 

Proof: Let A = I — PP' . By definition, 5 8 {A) := max{max| T |< s (A max (Ay J 4T) — 1), max| T |< s (l — A m i n (Ayv4<r)))}. 
Notice that A' T A T = I-I' T PP'I T . Since I' T PP'I T is p.s.d., by Weyl's theorem, X ms , x (A' T A T ) < 1. Since X max (A' T A T )-l < 
while 1 - X miD (A' T A T ) > 0, thus, 

6 S {I - PP') = max(l - A min (7 - I' T PP'I T )) (28) 

|T|<« 

By Definition, k s (P) = max| T |< s ^jp^- = max| T |< s \\r T P\\ 2 . Notice that H-T^PHl = X max {I' T PP' I T ) = 1 - A min (7 - 
I' T PP'I T ) 0, and so 

^(P) = max(l - A mi „(I - I' t PP'It)) (29) 
l T l< s 

From d28j and (|29j, we get 8 S (I - PP') = n 2 s (P). ■ 

F. The need for projection PCA (as in step 3 of Algorithm\Tty 

In this discussion, we remove the subscript j. Also, let P* := Pj-i, P* := Pj-i, n* = rank(P*). 

1) Why projection PCA and not simple PCA: Consider t = tj + ka — 1 when the k th projection PCA or PCA is done. 
Since the error et — L t ~ L t is correlated with L t , the dominant terms in the perturbation matrix seen by PCA are (l/ltj + 
ka)) St=i ° 1 ^< e t anc l i ts transpose, while for projection PCA, they are (l/a)$o J2tei- k ^t e i^0 an d its transposqfl The 
magnitude of L t can be quite large. The magnitude of e< is smaller than a constant times that of L t . The constant is less than 
one but, at t = tj + a — 1, it is not negligible. Thus, the norm of the perturbation seen by PCA at this time may not be small. 
As a result, the bound on the subspace error, SE( t ), obtained by applying the sin 9 theorem may be more than one (and hence 
meaningless since by definition SEm < 1). For projection PCA, because of <£>0i the perturbation is much smaller. 

Let SEfc := SE( tj+ k a ^ = SE( t ) denote the subspace error for t £ Zj,k- In quantitative terms, for PCA, we can show that 
SEi < Cnfg + + C' /£+ for constants C, C' that are more than one but not too large. Here g + is the upper bound on gj ^ 
(condition number of averaged Cov(at jnew )) and it is valid to assume that g + is small enough so that Cnfg + < 1. However, 
/ is the maximum condition number of Cov(a t ) and this can be large. When it is, the second term may not be less than one. 
On the other hand, for projection PCA, we have SE fc < Cfc + C* < Ck + C* wit h C + = r(, and w &K+g + C£_i + C"/(C+) 2 
and Co" = 1 (see Sec ED- Thus SEi < Cn+g+ + C"/(C+) 2 + C + - The first term in this bound is similar to that of PCA, but 

4 This follows because B = I' T PP'I T is a Hermitian matrix. Let B = USU' be its EVD. Since UU' = I, A min (7 - B) = A min (t/(/ - T,)U') = 

A m in(-^ — ^) — 1 — A max (S) = 1 — A max (-B). 

5 If et and Lt were uncorrelated, as is assumed in most work analyzing finite sample PCA, e.g. see [20], and since Lt is zero mean, these terms would be 
close to zero w.h.p. (due to law of large numbers) and the dominant perturbation term in either case would depend only on etej. 
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the second term is much smaller. The third term is negligibly small (under the slow subspace change assumption). Thus, in 
this case, it is easier to ensure that the bound is less than one. 

Moreover, our goal is to show that within a finite delay after a subspace change time, the subspace error decays down from 
one to a value proportional to £. For projection PCA, this can be done because we can separately bound the subspace error 
of the existing subspace, £*, and of the newly added one, and then bound the total subspace error, SE( t ), by + ^ for 
t E Zj,k- Assuming that, by t = tj, is small enough, i.e. < r*( with £ < 0.00015/r 2 /, we can show that within K 
iterations, (k also becomes small enough so that SE( t j < (r* + c)£. However, for PCA, it is not possible to separate the 
subspace error in this fashion. For fc > 1, all we can claim is that SE^ < Cnf / SEfc_i. Since / can be large (larger than 
1/k+), this cannot be used to show that SE/j decreases with fc. 

2) Why multiple iterations to get a final estimate of P^ neK : The reason is again because e f and L t are correlated and so 
the dominant perturbation terms are (l/a) J^t &oLte' t <f>a and its transpose. The magnitude of et is smaller than a constant 
times that of L t . The constant is less than one, but, at the first projection PCA time, it is not of the order of With the first 
projection PCA, we get the first estimate of P new , Pnew,i- Using this in the projected CS steps ensures a smaller e t for the 
second interval and thus a smaller perturbation seen by the second projection PCA. This results in a reduced perturbation seen 
by the second projection PCA step and thus an improved estimate P^new.2- This makes et even smaller for the third interval 
and so on. Within K steps, we can show that it is small enough to ensure that SE( t \ is proportional to (,. 

3) Why not use all ka frames at t = tj + ka— 1: Another possible way to implement projection PCA is to use the past ka 
estimates L t at the k th projection PCA time, t = tj + ka — 1. This may actually result in an improved algorithm. We believe that 
it can also be analyzed using the approaches developed in this paper. However, the analysis will be more complicated. We briefly 
try to explain why. The perturbation seen at t = tj + ka — 1, Tik, will now satisfy Tik ~ (^/(ka)) 2~2k'=i 2~2tei- y ^o{~Lte' t — 
e t L' t + ete' t )$o instead of just being approximately equal to the last (fc' = fc) term. Bounds on each of these terms will hold 
with a different probability. Thus, proving lemmas similar to Lemma [37] and Lemma [40] will be more complicated. 

G. Proof Outline for Corollary [57] 

In the discussion below, we again remove the subscript j and the convention of Remark [27] applies. Also, as before, let 

:= Pj-i and P» := Pj_ x . 

First, the measurement matrix used in the CS step is now $( t )A and so we need to compute or bound its PJC. We do this 
for any basis matrix P in Lemma [43] below. In Lemma l44l we first bound k s ^a{P*) and K s ,A{Pnew.k) in terms of k s a(-F*) 
and K SlJ 4(P n ew) and the estimation erorrs and then use these bounds and Lemma |431 to bound the RIC of the measurement 
matrix used in the CS step. 

Lemma 43: For an n x r matrix P satisfying P'P = /, 

5 S ((I - PP')A) < 5 S (A) + k s , a {P?- 

Proof: The proof strategy is similar to that for Lemma [16] ■ 
Lemma 44: Recall that C* := ||(/ - P*%)P*h and Cfc := IK' - P*% - Aew,fcP n ' ew , fc )Pnew|| 2 - Let Ka , A> . := k», a (P,), 

K s ,A,new = « S ,A(-Pnew). K S ,A,k ■= «a,yi(U ~ ^i,new^i,newO^,new,fe). The following hold. 
1) K s ,a(P) < K S ,A,* + C* 

+ H s ,A,k(k + C* 

3) <*.((/ - P,Pl)A) < 5 S {A) + K StA (P*) 2 < S.(A) + (k s , a ,* + C*) 2 

4) 6 S ((I - P,Pl - Pnew,fcP n ' ew k )A) < S S (A) + ^^(Pnew.fc) 2 < S S (A) + ( Ks,A,new + Ks,A, 

Proof: The proof strategy for proving the first two claims is similar to that used for proving the second claim of Lemma 
|281 The third and fourth claims follow directly using the first and second claims and Lemma |43] ■ 
Because the bound on the RIC now also contains 6 S (A), thus, we need a bound on 5 S (A). Also, to get the same bound as earlier 
on the RIC, we now need tighter bounds on k S) a(P*), Ks,a(P>«w) an( ^ K s.A(Qmw.k)- If these bounds are tight enough, e.g. if 
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KsA p *) < K t,A = °- 22 ' « S ,A(-Pnew) < «£w >i4 = °- 14 and K s ^(Qnew,fc) < «+ A = 0.14, then we will still get 6 a ($ k A) < 0.15 
for all k (which is the same bound as earlier). Thus, we would get the same bound on the CS error and the same condition on 
the support estimation threshold ui to ensure exact support recovery. Thus, the first three claims of Lemma [30] will not change. 

A second change in our algorithm is that in the LS step, $>m is replaced by $>n)A. For t € Ij,k, &(t) — ^fc-i- Thus the 
expression for e< := St — St changes to 



If we proceed to bound this as in the proof of the fourth claim of Lemma [30] we realize that the bound expression remains 
the same except that (i) k s (.) is replaced by k s ,a(-) everywhere, and (ii) we need to assume « S ,,4.(-D*,fe) < 1. Recall that, 
for any matrix B, the denseness coefficient k s (B) < 1. For the full-sampled case, we never explicitly bounded the denseness 
coefficient of K s (D lf ^), but just used the loose bound K s (-D* t fc) < 1. However k S} a(B) may or may not be less than one for 
a given matrix B. Thus, we need to explicitly assume ^^(-D*^) < 1. 

Lemma |3T| which bounds the subspace error remains the same. Since Lemma [33] depends only on components of the matrix 
J2t LtL't, there is no change in it either. 

A third change in the algorithm is that, in step 2, we now estimate L t as L t — M t — ASt- As a result, now 

U-L t = A(S t - S t ) = Ae t = J 4 Tt [($ fc _iA) T /($ fe _ 1 A)T t ] _1 A Tt / [($ fe _ 1 P,)a i ,, + J D ne w,fc-ia t>n ew]- 

Because of this, the perturbation matrix now becomes Hk = — [$o X)t(^* e t^' + AetL' t + Aete' t A')^Q + F + F'} where F :— 
£new,x£Lw,X $ o£ t £i^ $ o£new-Eoew- Notice that \\ Ae th < \\A Tt h\\ e th < V 1 + 6 s (A) \\ e t || 2 . This causes the following 
changes in the bounds and the probabilities in Lemma [34] Firstly, k s (.) is replaced by k s> a{.) everywhere. Secondly, the 
constants bi, bi . . . b% change as follows: b\, 62 get replaced by (1 + 5 s (A))bi, (1 + S s (A))b2', 63, 64 do not change; 65, b§ get 
replaced by y/l + 5 s (A)b 5 and \Jl + 8 s (A)b 6 ; and b-j,b s get replaced by (1 + S s (A))by and (1 + 6 s (A))b$. If k* a is small 
enough (need a value slightly smaller than k+ = 0.15), then we get the same final bound on ||"Hfe||2 as in Lemma [34] The 
probability expressions change due to the changes in b\, 65, 67. There is no change in Lemma [35] except for the change in the 
expression for the probability pk(a,()- Thus the definition of also does not change and consequently there is no change 
in Lemma [37] Thus, Lemma l40l also remains the same. Thus, the bounds on the subspace error SE(t) and the sparse recovery 
error 1 1 ] 1 2 also remain the same. 

As explained above, because of the increases in the constants 61,65,67, the expressions for p ay k(a, C),Pb(ot, (),p c (a, C) 
change. Thus, to ensure that our result holds with the same probability, we will need a larger a a dd,A- In particular a a dd.A > 
(1 + <5 s (A))a a dd suffices. Since 8 S (A) < 0.05, a^.A > 1.05a a dd ensures that this holds. 
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